AutoML your way off the Titanic

ML.NET looks like the easiest solution to get started with machine learning in .NET since numl.
The visual Model Builder makes ML completely dummy proof for non-data scientists by way of “automated ML” or AutoML for short. I decided to test drive the command line version on the Kaggle starter Titanic challenge.

Setup

I assume you have installed everything to get started with ML.NET on the command line. Plus you have registered on Kaggle and joined the Titanic challenge.

One simple command

mlnet auto-train --task binary-classification --dataset "train.csv" --label-column-name Survived --ignore-columns PassengerId
  • –task binary-classification
    • We’re trying to predict whether a passenger either survived, or did not survive, so this is binary classification
  • –dataset “train.csv”
  • –label-column-name Survived
    • The survived column labels the rows
  • –ignore-columns PassengerId
    • In the true spirit of AutoML I’m assuming we haven’t done any exploratory research of the data, but I don’t want to make it harder than necessary either. When previewing the data on Kaggle we can safely assume PassengerId are unique identifiers that do not give any insight into whether a passenger might have survived or not

The command will run for 30 minutes, unless you specify –max-exploration-time. More time, means more time to fine tune the resulting model, which could lead to better results.

This is what I assume happens automatically, roughly:

  1. Select a binary classification algorithm
  2. Train the model on 80% of the given training data
    1. This means tuning the selected algorithm until for the given data, (most) of the results match the matching Survived column
    2. Possibly it uses 100% of the training and data and performs cross validation
  3. Test the resulting model on the remaining 20% of the training data
    1. The model hasn’t seen this data, so this is a true test on the quality of the tuned algorithm
  4. Repeat step 1-3 with the other binary classification algorithm available in ML.NET
  5. Compare the results of steps 3 and pick the best model

Ta-daa

|                                              Top 5 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
|1    LinearSvmBinary                        0,8462   0,8937   0,9075    0,8182       0,5          5             |
|2    SgdCalibratedBinary                    0,8462   0,8671   0,8678    0,8125       0,5         24             |
|3    SgdCalibratedBinary                    0,8462   0,8664   0,8637    0,8125       0,7         72             |
|4    LinearSvmBinary                        0,8462   0,8625   0,8656    0,8125       0,9         73             |
|5    SgdCalibratedBinary                    0,8462   0,8671   0,8649    0,8125       0,7         75             |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: ...\SampleBinaryClassification\SampleBinaryClassification.Model\MLModel.zip    
Generated C# code for model consumption: ...\SampleBinaryClassification\SampleBinaryClassification.ConsoleApp
Check out log file for more information: ...\SampleBinaryClassification\logs\debug_log.txt

It says LinearSvmBinary gave the best bang for the buck, and it saved the model in MLModel.zip

We can run this model on test.csv and submit the results to Kaggle for scoring, we just need to apply a few code changes:

  1. Open the solution SampleBinaryClassification.sln
  2. Update train.csv to test.csv in SampleBinaryClassification.ConsoleApp.Program.DATA_FILEPATH
  3. Copy ModelInput to TestModelInput and remove property Survived, renumber the LoadColumn attributes accordingly
    1. You’re with me right, test.csv doesn’t have column Survived and we need the model to match the data we’re loading
  4. Add property public float PassengerId { get; set; } to ModelOutput, this is required for the Kaggle submission
  5. Because the trained model expects ModelInput, we need to transform TestModelInput back, plus we want to perform multiple predictions, update your Program like so:
    1.         static void Main(string[] args)
              {
                  MLContext mlContext = new MLContext();
      
                  var data = mlContext.Data.LoadFromTextFile<TestModelInput>(DATA_FILEPATH,
                      separatorChar: ',', hasHeader: true, allowQuoting: true);
      
                  var modelInputs = mlContext.Data.LoadFromEnumerable(
                      mlContext.Data.CreateEnumerable(data, true).Select(p => new ModelInput
                      {
                          PassengerId = p.PassengerId,
                          Pclass = p.Pclass,
                          Name = p.Name,
                          Sex = p.Sex,
                          Age = p.Age,
                          SibSp = p.SibSp,
                          Parch = p.Parch,
                          Ticket = p.Ticket,
                          Fare = p.Fare,
                          Cabin = p.Cabin,
                          Embarked = p.Embarked
                      }));
      
                  var predictionPipeline = mlContext.Model.Load(MODEL_FILEPATH, out DataViewSchema predictionPipelineSchema);
                  var predictions = predictionPipeline.Transform(modelInputs);
                  var survivalPredictions = mlContext.Data.CreateEnumerable(predictions, reuseRowObject: false);
      
                  File.WriteAllLines("kaggle_submission.csv",
                      new string[] { "PassengerId,Survived" }
                      .Concat(survivalPredictions.Select(p => $"{p.PassengerId},{(p.Prediction ? 1 : 0)}")));
      
                  Console.WriteLine("=============== End of process, hit any key to finish ===============");
                  Console.ReadKey();
              }

Run the program and find kaggle_submission.csv in the bin folder, submit to Kaggle.

This gives me a score of 0.78468, which is not my best score so far (0.80382), but not bad for a command and a few code changes I think.

Theme music for this blog post

AutoML your way off the Titanic

RTFAQ – Azure App Service request timeout limit

I had to do a one-time POST to an Azure App Service, to trigger some post-release task, which would be performed on that request thread.
The task could take long, but because this was a one-time thing, setting up a mechanism to perform background processing properly, wasn’t worth it.

Everything worked fine on my machine, until it had to run on Azure. The post-release request got aborted after a couple of minutes, while the server kept processing the request.

Increasing the HttpClient.Timeout property didn’t help.

After finding the right keywords to search this problem, it turns out the explanation was hiding in plain sight in the FAQ all along:

Why does my request time out after 230 seconds?

Azure Load Balancer has a default idle timeout setting of four minutes.

https://docs.microsoft.com/en-us/azure/app-service/faq-availability-performance-application-issues#why-does-my-request-time-out-after-230-seconds

This is a hard limit you can’t exceed, I guess Azure does this to protect its environment or at least have a theoretical limit they can build on to know when to scale/trigger ddos protection.

Theme music for this blog post

 

RTFAQ – Azure App Service request timeout limit

Proxy as a service

Azure Functions is a lightweight way to quickly expose an API, the “serverless” way.

Now I wanted to proxy another API to clients (to avoid having to expose the API key and have a future extension point), so I reached for Azure Functions again. Just create an HTTP trigger to act as the proxy and forward calls to the other API, right?

Apparently this use case is already handled by Azure Functions Proxies

You literally just declare the proxy and it works, nice.
Add a proxies.json in the root of your Azure Functions app (assuming you are programming it, if you use the portal, just go to the Proxies node and follow the self -explanatory UI)

{
  "$schema": "http://json.schemastore.org/proxies",
  "proxies": {
    "name_your_proxy": {
      "matchCondition": {
        "methods": [ "POST" ],
        "route": "/api/proxy_call"
      },
      "backendUri": "https://the-api-you-want-to-proxy.com",
      "requestOverrides": {
        "backend.request.headers.example-secret": "dummy-secret" 
      }
    }
  }
}

Publish this and you can call your Azure Functions app on /api/proxy_call, which forwards the call to https://the-api-you-want-to-proxy.com while adding the header with the secret you don’t want to expose to your clients. Simple!

Theme music for this blog post

Proxy as a service

target _blank security issue

I think this is an old issue, but I only just learned about it via a lint error react/jsx-no-target-blank I got in a react project.

Apparently a new window opened by a link, has access to the originating window object.
When this new window is another (malicious) site, it has access to the dom of the site that linked it.

Adding rel=”noopener noreferrer” to your anchor tag mitigates this risk.

Check this much better quick decent explanation:
https://mathiasbynens.github.io/rel-noopener/

Theme music for this blog post

target _blank security issue