AutoML your way off the Titanic

ML.NET looks like the easiest solution to get started with machine learning in .NET since numl.
The visual Model Builder makes ML completely dummy proof for non-data scientists by way of “automated ML” or AutoML for short. I decided to test drive the command line version on the Kaggle starter Titanic challenge.

Setup

I assume you have installed everything to get started with ML.NET on the command line. Plus you have registered on Kaggle and joined the Titanic challenge.

One simple command

mlnet auto-train --task binary-classification --dataset "train.csv" --label-column-name Survived --ignore-columns PassengerId
  • –task binary-classification
    • We’re trying to predict whether a passenger either survived, or did not survive, so this is binary classification
  • –dataset “train.csv”
  • –label-column-name Survived
    • The survived column labels the rows
  • –ignore-columns PassengerId
    • In the true spirit of AutoML I’m assuming we haven’t done any exploratory research of the data, but I don’t want to make it harder than necessary either. When previewing the data on Kaggle we can safely assume PassengerId are unique identifiers that do not give any insight into whether a passenger might have survived or not

The command will run for 30 minutes, unless you specify –max-exploration-time. More time, means more time to fine tune the resulting model, which could lead to better results.

This is what I assume happens automatically, roughly:

  1. Select a binary classification algorithm
  2. Train the model on 80% of the given training data
    1. This means tuning the selected algorithm until for the given data, (most) of the results match the matching Survived column
    2. Possibly it uses 100% of the training and data and performs cross validation
  3. Test the resulting model on the remaining 20% of the training data
    1. The model hasn’t seen this data, so this is a true test on the quality of the tuned algorithm
  4. Repeat step 1-3 with the other binary classification algorithm available in ML.NET
  5. Compare the results of steps 3 and pick the best model

Ta-daa

|                                              Top 5 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
|1    LinearSvmBinary                        0,8462   0,8937   0,9075    0,8182       0,5          5             |
|2    SgdCalibratedBinary                    0,8462   0,8671   0,8678    0,8125       0,5         24             |
|3    SgdCalibratedBinary                    0,8462   0,8664   0,8637    0,8125       0,7         72             |
|4    LinearSvmBinary                        0,8462   0,8625   0,8656    0,8125       0,9         73             |
|5    SgdCalibratedBinary                    0,8462   0,8671   0,8649    0,8125       0,7         75             |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: ...\SampleBinaryClassification\SampleBinaryClassification.Model\MLModel.zip    
Generated C# code for model consumption: ...\SampleBinaryClassification\SampleBinaryClassification.ConsoleApp
Check out log file for more information: ...\SampleBinaryClassification\logs\debug_log.txt

It says LinearSvmBinary gave the best bang for the buck, and it saved the model in MLModel.zip

We can run this model on test.csv and submit the results to Kaggle for scoring, we just need to apply a few code changes:

  1. Open the solution SampleBinaryClassification.sln
  2. Update train.csv to test.csv in SampleBinaryClassification.ConsoleApp.Program.DATA_FILEPATH
  3. Copy ModelInput to TestModelInput and remove property Survived, renumber the LoadColumn attributes accordingly
    1. You’re with me right, test.csv doesn’t have column Survived and we need the model to match the data we’re loading
  4. Add property public float PassengerId { get; set; } to ModelOutput, this is required for the Kaggle submission
  5. Because the trained model expects ModelInput, we need to transform TestModelInput back, plus we want to perform multiple predictions, update your Program like so:
    1.         static void Main(string[] args)
              {
                  MLContext mlContext = new MLContext();
      
                  var data = mlContext.Data.LoadFromTextFile<TestModelInput>(DATA_FILEPATH,
                      separatorChar: ',', hasHeader: true, allowQuoting: true);
      
                  var modelInputs = mlContext.Data.LoadFromEnumerable(
                      mlContext.Data.CreateEnumerable(data, true).Select(p => new ModelInput
                      {
                          PassengerId = p.PassengerId,
                          Pclass = p.Pclass,
                          Name = p.Name,
                          Sex = p.Sex,
                          Age = p.Age,
                          SibSp = p.SibSp,
                          Parch = p.Parch,
                          Ticket = p.Ticket,
                          Fare = p.Fare,
                          Cabin = p.Cabin,
                          Embarked = p.Embarked
                      }));
      
                  var predictionPipeline = mlContext.Model.Load(MODEL_FILEPATH, out DataViewSchema predictionPipelineSchema);
                  var predictions = predictionPipeline.Transform(modelInputs);
                  var survivalPredictions = mlContext.Data.CreateEnumerable(predictions, reuseRowObject: false);
      
                  File.WriteAllLines("kaggle_submission.csv",
                      new string[] { "PassengerId,Survived" }
                      .Concat(survivalPredictions.Select(p => $"{p.PassengerId},{(p.Prediction ? 1 : 0)}")));
      
                  Console.WriteLine("=============== End of process, hit any key to finish ===============");
                  Console.ReadKey();
              }

Run the program and find kaggle_submission.csv in the bin folder, submit to Kaggle.

This gives me a score of 0.78468, which is not my best score so far (0.80382), but not bad for a command and a few code changes I think.

Theme music for this blog post

AutoML your way off the Titanic