Mapping unmapped XML elements

Not all API’s return JSON, XML is still here and I’m still learning things about XML serialization.

Building with WCF back in the day, I learned about IExtensibleDataObject, a way to receive and return unmapped XML elements.
This can help forwards compatibility with your DataContracts, a client won’t discard unknown properties.

Store extra data encountered by the XmlObjectSerializer during deserialization of a DataContract

IExtensibleDataObject

We came across exactly this problem, not a SOAP API, but a REST API that returns XML We as a client only used a subset of an XML API contract, but had to PUT back all elements in an update, or properties would be reset on the server.

I rembered IExtensibleDataObject, but this was no help since we didn’t use the DataContract attributes/XmlObjectSerializer approach. Instead classes are marked with XmlElementAttribute and so on.

Just before we were about to go the custom solution way, a colleague came accross XmlAnyElementAttribute.

This member contains objects that represent any XML element that has no corresponding member in the object being serialized or deserialized

XmlAnyElementAttribute

How nice is this, you add a property decorated with this attribute to capture all unmapped/unknown properties, and you can post them back to the server without losing anything.

Theme music for this blogpost

Mapping unmapped XML elements

AutoML your way off the Titanic

ML.NET looks like the easiest solution to get started with machine learning in .NET since numl.
The visual Model Builder makes ML completely dummy proof for non-data scientists by way of “automated ML” or AutoML for short. I decided to test drive the command line version on the Kaggle starter Titanic challenge.

Setup

I assume you have installed everything to get started with ML.NET on the command line. Plus you have registered on Kaggle and joined the Titanic challenge.

One simple command

mlnet auto-train --task binary-classification --dataset "train.csv" --label-column-name Survived --ignore-columns PassengerId
  • –task binary-classification
    • We’re trying to predict whether a passenger either survived, or did not survive, so this is binary classification
  • –dataset “train.csv”
  • –label-column-name Survived
    • The survived column labels the rows
  • –ignore-columns PassengerId
    • In the true spirit of AutoML I’m assuming we haven’t done any exploratory research of the data, but I don’t want to make it harder than necessary either. When previewing the data on Kaggle we can safely assume PassengerId are unique identifiers that do not give any insight into whether a passenger might have survived or not

The command will run for 30 minutes, unless you specify –max-exploration-time. More time, means more time to fine tune the resulting model, which could lead to better results.

This is what I assume happens automatically, roughly:

  1. Select a binary classification algorithm
  2. Train the model on 80% of the given training data
    1. This means tuning the selected algorithm until for the given data, (most) of the results match the matching Survived column
    2. Possibly it uses 100% of the training and data and performs cross validation
  3. Test the resulting model on the remaining 20% of the training data
    1. The model hasn’t seen this data, so this is a true test on the quality of the tuned algorithm
  4. Repeat step 1-3 with the other binary classification algorithm available in ML.NET
  5. Compare the results of steps 3 and pick the best model

Ta-daa

|                                              Top 5 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
|1    LinearSvmBinary                        0,8462   0,8937   0,9075    0,8182       0,5          5             |
|2    SgdCalibratedBinary                    0,8462   0,8671   0,8678    0,8125       0,5         24             |
|3    SgdCalibratedBinary                    0,8462   0,8664   0,8637    0,8125       0,7         72             |
|4    LinearSvmBinary                        0,8462   0,8625   0,8656    0,8125       0,9         73             |
|5    SgdCalibratedBinary                    0,8462   0,8671   0,8649    0,8125       0,7         75             |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: ...\SampleBinaryClassification\SampleBinaryClassification.Model\MLModel.zip    
Generated C# code for model consumption: ...\SampleBinaryClassification\SampleBinaryClassification.ConsoleApp
Check out log file for more information: ...\SampleBinaryClassification\logs\debug_log.txt

It says LinearSvmBinary gave the best bang for the buck, and it saved the model in MLModel.zip

We can run this model on test.csv and submit the results to Kaggle for scoring, we just need to apply a few code changes:

  1. Open the solution SampleBinaryClassification.sln
  2. Update train.csv to test.csv in SampleBinaryClassification.ConsoleApp.Program.DATA_FILEPATH
  3. Copy ModelInput to TestModelInput and remove property Survived, renumber the LoadColumn attributes accordingly
    1. You’re with me right, test.csv doesn’t have column Survived and we need the model to match the data we’re loading
  4. Add property public float PassengerId { get; set; } to ModelOutput, this is required for the Kaggle submission
  5. Because the trained model expects ModelInput, we need to transform TestModelInput back, plus we want to perform multiple predictions, update your Program like so:
    1.         static void Main(string[] args)
              {
                  MLContext mlContext = new MLContext();
      
                  var data = mlContext.Data.LoadFromTextFile<TestModelInput>(DATA_FILEPATH,
                      separatorChar: ',', hasHeader: true, allowQuoting: true);
      
                  var modelInputs = mlContext.Data.LoadFromEnumerable(
                      mlContext.Data.CreateEnumerable(data, true).Select(p => new ModelInput
                      {
                          PassengerId = p.PassengerId,
                          Pclass = p.Pclass,
                          Name = p.Name,
                          Sex = p.Sex,
                          Age = p.Age,
                          SibSp = p.SibSp,
                          Parch = p.Parch,
                          Ticket = p.Ticket,
                          Fare = p.Fare,
                          Cabin = p.Cabin,
                          Embarked = p.Embarked
                      }));
      
                  var predictionPipeline = mlContext.Model.Load(MODEL_FILEPATH, out DataViewSchema predictionPipelineSchema);
                  var predictions = predictionPipeline.Transform(modelInputs);
                  var survivalPredictions = mlContext.Data.CreateEnumerable(predictions, reuseRowObject: false);
      
                  File.WriteAllLines("kaggle_submission.csv",
                      new string[] { "PassengerId,Survived" }
                      .Concat(survivalPredictions.Select(p => $"{p.PassengerId},{(p.Prediction ? 1 : 0)}")));
      
                  Console.WriteLine("=============== End of process, hit any key to finish ===============");
                  Console.ReadKey();
              }

Run the program and find kaggle_submission.csv in the bin folder, submit to Kaggle.

This gives me a score of 0.78468, which is not my best score so far (0.80382), but not bad for a command and a few code changes I think.

Theme music for this blog post

AutoML your way off the Titanic

RTFAQ – Azure App Service request timeout limit

I had to do a one-time POST to an Azure App Service, to trigger some post-release task, which would be performed on that request thread.
The task could take long, but because this was a one-time thing, setting up a mechanism to perform background processing properly, wasn’t worth it.

Everything worked fine on my machine, until it had to run on Azure. The post-release request got aborted after a couple of minutes, while the server kept processing the request.

Increasing the HttpClient.Timeout property didn’t help.

After finding the right keywords to search this problem, it turns out the explanation was hiding in plain sight in the FAQ all along:

Why does my request time out after 230 seconds?

Azure Load Balancer has a default idle timeout setting of four minutes.

https://docs.microsoft.com/en-us/azure/app-service/faq-availability-performance-application-issues#why-does-my-request-time-out-after-230-seconds

This is a hard limit you can’t exceed, I guess Azure does this to protect its environment or at least have a theoretical limit they can build on to know when to scale/trigger ddos protection.

Theme music for this blog post

 

RTFAQ – Azure App Service request timeout limit

Proxy as a service

Azure Functions is a lightweight way to quickly expose an API, the “serverless” way.

Now I wanted to proxy another API to clients (to avoid having to expose the API key and have a future extension point), so I reached for Azure Functions again. Just create an HTTP trigger to act as the proxy and forward calls to the other API, right?

Apparently this use case is already handled by Azure Functions Proxies

You literally just declare the proxy and it works, nice.
Add a proxies.json in the root of your Azure Functions app (assuming you are programming it, if you use the portal, just go to the Proxies node and follow the self -explanatory UI)

{
  "$schema": "http://json.schemastore.org/proxies",
  "proxies": {
    "name_your_proxy": {
      "matchCondition": {
        "methods": [ "POST" ],
        "route": "/api/proxy_call"
      },
      "backendUri": "https://the-api-you-want-to-proxy.com",
      "requestOverrides": {
        "backend.request.headers.example-secret": "dummy-secret" 
      }
    }
  }
}

Publish this and you can call your Azure Functions app on /api/proxy_call, which forwards the call to https://the-api-you-want-to-proxy.com while adding the header with the secret you don’t want to expose to your clients. Simple!

Theme music for this blog post

Proxy as a service

target _blank security issue

I think this is an old issue, but I only just learned about it via a lint error react/jsx-no-target-blank I got in a react project.

Apparently a new window opened by a link, has access to the originating window object.
When this new window is another (malicious) site, it has access to the dom of the site that linked it.

Adding rel=”noopener noreferrer” to your anchor tag mitigates this risk.

Check this much better quick decent explanation:
https://mathiasbynens.github.io/rel-noopener/

Theme music for this blog post

target _blank security issue

EF6 VS EF Core inheritance

Lost some time with this subtle gotcha, so maybe this will help you spare time.

In EF6 you can configure inheritance with a single type.

An inheritance hierarchy with a single type might seem pointless, but I saw this get used to automatically filter data, which I found pretty clever (to ignore records marked as deleted, the discriminator column was used to only map records with deleted set to false)

Say you want to map SomeEntity only to rows that have value “SomeEntityType” in the field Discriminator:

public class SomeContext : DbContext
{
   protected override void OnModelCreating(DbModelBuilder modelBuilder)
   {
      modelBuilder.Entity<SomeEntity>()
         .Map(m => m.Requires("Discriminator")
            .HasValue("SomeEntityType")
         );
   }
}

Now, transporting the EF6 implementation as is to EF Core syntax won’t work:

// EF Core: incorrect
modelBuilder.Entity<SomeEntity>()
   .HasDiscriminator<string>("Discriminator")
   .HasValue<SomeEntity>("SomeEntityType");

However the gotcha is, that it won’t work, but it won’t crash or warn you either. It will just return all SomeEntities, ignoring the discriminator. Nothing gets added to the WHERE clause in the generated T-SQL.

I spend some time troubleshooting this, until I carefully read the docs again:

EF will only setup inheritance if two or more inherited types are explicitly included in the model

https://docs.microsoft.com/en-us/ef/core/modeling/inheritance

So in the incorrect mapping, the base type gets setup but nothing else, so no discriminator gets applied (it just ignores the faulty configuration apparently).

You actually need have a separate base and derived type, and map them accordingly, like so:

// EF Core: correct
modelBuilder.Entity<SomeEntityBase>()
   .HasDiscriminator<string>("Discriminator")
   .HasValue<SomeEntityBase>("NotSomeEntityType")
   .HasValue<SomeEntity>("SomeEntityType");

A bit more verbose, but in normal inheritance scenarios you would have had the base type anyway.

Theme music for this blog post

EF6 VS EF Core inheritance