Machine Learning 101 with Microsoft ML.NET (part 2/3)

Daniel Costea
Senior Software Developer @ EU Agency

PROGRAMMING

Model builder pipeline

At the end of the first part of the series we had a preprocessing pipeline ready to load data and concatenate the features selected for the training into one special feature called Features, and a target feature named Label serving as a category where the selected feature classifies. If we don't have such column as Label in our dataset, we have to annotate the target field like it follows (of course, for another problem, we may have another target feature we need to annotate):

[ColumnName("Label")]
public string Source { get; set; }

ML Context

Before proceeding to build the training pipeline, let me introduce the MLContext catalog container. Inside this catalog, we can find all the trainers, data loaders, data transformers and predictors which we can use for a large variety of tasks like: regression, classification, anomaly detection, recommendation, clustering, forecasting, image classification and object detection. Many of them are part of additional nuget packages.

The seed parameter is used internally by splitters and by some trainers to give the model training a deterministic behavior (this is very useful for unit tests).

What is an ML trainer?

There are multiple training algorithms for every kind of ML.NET task, such as trainers which can be found in their corresponding trainer catalogs. For example, Stochastic Dual Coordinated Ascent which we used in this article is available as Sdca (for regression), SdcaNonCalibrated and SdcaLogisticRegression (for binary classification), and SdcaNonCalibrated and SdcaMaximumEntropy (for multi-classification).

Let's go back to the preprocessing pipeline created in the first part of this article:

var featureColumns = new string[] { 
  "Temperature", 
  "Luminosity", 
  "Infrared", 
  "Distance" };

var preprocessingPipeline = 
  mlContext.Transforms.Conversion
  .MapValueToKey("Label")
  .Append(mlContext.Transforms
    .Concatenate("Features", 
    featureColumns));

We can now continue by extending the pipeline and selecting a trainer.

var trainingPipeline = 
  preprocessingPipeline
  .Append(mlContext
    .MulticlassClassification
    .Trainers
    .SdcaNonCalibrated("Label", 
    "Features"));

We are not done yet until we perform the post-processing which in our case is simply mapping the key to a value in order to make the prediction human readable (see above the mapping from value to key in the preprocessing pipeline).

var postprocessingPipeline = 
  trainingPipeline
    .Append(
     mlContext.Transforms
       .Conversion
       .MapKeyToValue
      ("PredictedLabel"));

Is the training pipeline enough to create the model? Of course not, we need to feed in some data, in order to get the model trained.

Please notice that until we call the Fit method on our pipeline, we don't get any piece of data from the loader, except its schema, and that's because the DataView is lazy-loading (you can think of it as a Linq IEnumerable).

var model = postprocessingPipeline.Fit(trainingData);

Weights and biases

At its very core, a trainer is a generic algorithm which has the ability to tune its weights and biases by training on data. The more the data, the better the tuning.

VBuffer <float>[] weights = default;
model.Model.GetWeights(ref weights,
  out int numClasses);

var biases = model.Model.GetBiases();

Model Builder and Automated ML

Maybe the first concern for a developer when it comes to Machine Learning is choosing a good trainer, but this should not be an exclusive data scientist task. ML.NET includes an excellent technology called Automated ML (AutoML), which comes in different tastes like:

ML.NET Model Builder (we can find it as Machine Learning visual tool in Visual Studio 2017+) to experience with different Machine Learning tasks on datasets and to generate ready to use C# code
ML.NET CLI tool (we can install it as a global tool). Once we install the tool, we can choose a Machine Learning task and a dataset to generate the ML.NET model, as well as ready to use C# code (pretty similar to the Model Builder).

mlnet regression --dataset "sensors_data.csv" --train-time 600

C#/F# plain code (which we can integrated it in our applications)

var experimentResult = Context.Auto()
  .CreateMulticlassClassificationExperiment(
     ExperimentTime)
  .Execute(trainingDataView);

var bestRun = experimentResult.BestRun;
var model = bestRun.Model;

Measure the model quality

Validate the model

I have already mentioned the model performance a few times, but how can we measure that?

A well-known validation technique is k-fold cross-validation, which can be used to tune the hyperparameters. A hyperparameter is a parameter used to control the learning process and we can find and modify it in the method signature of a trainer. Model parameters are internal to models and can be learned directly from the training data, but hyperparameters cannot.

var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(trainingData, postprocessingPipeline, numberOfFolds: 5, labelColumnName: "Label");

The cross-validation checks for model consistency and it basically:

Divides the dataset into k sub-datasets (folds)
Trains the model on k - 1 folds and leaves one fold for testing
Evaluates the model on testing fold
Repeats the previous steps for another set of k - 1 folds so the testing fold is different than previous testing folds used before
Computes the average metrics, standard deviation and confidence interval

Let's see below a graphical representation for 5 folds cross validation.

Image 1 - cross-validation

Please notice that cross-validation is pretty time-consuming, since it's training the model k times. We should not care too much, since this is not done in production. What is more important is the fact that it's a good choice when only a limited amount of data is available, as the cross-validation is re-using the data from the training dataset.

Our macro accuracy and micro accuracy are around 95%, which is very good. If the accuracy is bad, we can presume our model is underfitting and we need more features to incorporate relevant data or maybe we need more data. If we have bad performance when validating a training dataset, we cannot expect good performance on our testing dataset.

Dimensionality Reduction

Our intuition tells us, from the Machine Learning model prediction perspective, that we need as many features as possible, but from the performance perspective we need as few features as we possibly can, to get our problem solved. Indeed, sometimes, a faster Machine Learning model is preferable to a more performant one. The Correlation Matrix (more details about the Correlation Matrix in first part of this article) is just one way to decide which features to keep for classification and regression models. Another way is PFI (Permutation Feature Importance).

PFI (Permutation feature importance)

By randomly shuffling the dataset data one feature at a time we measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. A larger change indicates a more important feature, so we can select the most important features (focusing on using a subset of more meaningful features) to build our model. In this way we can potentially reduce noise and training time.

By adding a strongly correlated feature, the importance of the associated feature is expected to decrease.

In the above image, we can see that the Day and Hour features, and maybe even Distance, can be removed without notable impact on the model performance.

var transformedData = model.Transform(trainingData);
var linearPredictor = model.LastTransformer;
var permutationMetrics = mlContext
  .MulticlassClassification
  .PermutationFeatureImportance(linearPredictor, 
    transformedData);

Evaluate the model

Evaluation metrics are similar to validation, but they are very different in terms of scope. The evaluation is done on the testing dataset (unseen data), which was put aside when we had to split the original dataset. The testing dataset didn't take part in the training process at all. We can tell a model is overfitting if it performs well on the training dataset, but it performs badly on the testing dataset.

var predictions = model.Transform(testingData);
var metrics = mlContext
  .MulticlassClassification
  .Evaluate(
   predictions, "Label", "Score", "PredictedLabel");

Normalization

The first impulse is to consider the impact (or the weight, if you like) of each of the features on the model, as being the same. Irrespective of their type (texts, categories, singles, doubles), for Machine Learning they are all numbers and, obviously, the values from different ranges will have different impacts. Therefore, the values have to be normalized (to bring the values in the same range). If we do not normalize our features, we risk to train a bad model.

Some trainers don't require explicit normalization (they are normalizing the features implicitly), but other trainers do. Fortunately, we don't have to know the algorithm in order to decide that, because we can find the description of a trainer in the official documentation.

Trainer Characteristics

Machine learning task	Multiclass classification
Is normalization required?	Yes
Is caching required?	No
Required NuGet in addition to	None
Microsoft.ML
Exportable to ONNX	Yes

var preprocessingPipeline = mlContext
  .Transforms
  .Conversion.MapValueToKey("Label")
  .Append(mlContext
    .Transforms
    .CustomMapping
      (CustomMappings.IncomeMapping, 
      nameof(CustomMappings.IncomeMapping)))
  .Append(mlContext.Transforms
    .Concatenate("Features", featureColumns))
    .Append(mlContext.Transforms
       .NormalizeMinMax("Features"));

Evaluation Metrics

Micro-Accuracy aggregates the contributions of all classes to compute the average metric.

The closer to 1.00, the better.

In a multi-class classification task, micro-accuracy is preferable to macro-accuracy if you suspect there might be class imbalance.

Macro-Accuracy is the average accuracy at class level. The accuracy for each class is computed and the macro-accuracy is the average of these accuracies.

The closer to 1.00, the better.

Log-Loss measures the performance of a classification model where the prediction input is a probability value between 0.00 and 1.00.

The closer to 0.00, the better.

The goal of our Machine Learning models is to minimize this value.

Log-Loss reduction can be interpreted as the advantage of the classifier over a random prediction.

It ranges from -inf and 1.00, where 1.00 indicates perfect predictions and 0.00 indicates mean predictions.

For example, if the value equals 0.20, it can be interpreted that "the probability of a correct prediction is 20% better than random guessing".

Confusion Matrix

In addition to micro-accuracy, macro-accuracy and log-loss, we can measure the Confusion Matrix, which is a table that describes the performance of the model by categories. Using the testing dataset we can make predictions, we can compare the predicted results to the actual results for each category, and we can arrange them in a Confusion Matrix.

var metrics = mlContext.MulticlassClassification
  .Evaluate
   (predictions, Label, Score, PredictedLabel);

Console.WriteLine(metrics
  .ConfusionMatrix.GetFormattedConfusionTable());

In the previous diagram we could see that the FlashLight class was predicted correctly (as FlashLight source) 66 times and incorrectly as Day once, and Lighter once as well, resulting in a 0.9706 precision rate, or 97.06%, if you like.

Recall (or sensitivity) is another way to measure the quality of our model. For example, Infrared is incorrectly predicted as FlashLight once, which gives us a precision of 98% for a sensitivity of 100%. For another example, Day is predicted correctly 100% but the sensitivity is only 96.55% (which is not so bad!).

Save the model

After experimenting with different sets of features, trainers and parameters building a Machine Learning model, we choose one which best fits to our needs. Very probably we need the model in a production scenario, so we have to persist the model to a physical file in native (ML.NET) format or in ONNX format (which is a portable format developed by Microsoft and Facebook and adopted by other big players too). Unsurprisingly, if we take a look at the files of the native ML.NET model we realize it's a zip of plain text files containing numbers (weights and biases, or coefficients).

Saving the model in native format:

mlContext.Model.Save(model,
  trainingData.Schema, 
  MODEL_PATH);
Salvați modelul în format ONNX:
using (var stream = System.IO.File.Create(
  ONNX_MODEL_PATH))

mlContext.Model.ConvertToOnnx(
  model, trainingData, stream);

Model Re-train

In terms of time performance, training a model is very expensive. When training a model with more data, the model gets better, but what should we do once the model is built? Should we rebuild it again with larger data? Of course not. We can throw more data (for retrainable trainers only!) into an existing model and re-train it.

For re-training a model, we need to extract the model parameters from the original model (which represent the starting point of the new model) and call the Fit method with the new dataset and the original parameters.

Loading the original model should be as follows:

var model = mlContext.Model.Load("model.zip", var out modelSchema);

We can extract the original parameters and prepare the new dataset:

var originalModelParameters =
  ((ISingleFeaturePredictionTransformer<object>)
  model)
    .Model as LinearMulticlassModelParameters;

var transformedNewData = preprocessingPipeline
  .Transform(newData);

And then we re-train the model:

var retrainedModel = mlContext
  .MulticlassClassification
  .Trainers.SdcaNonCalibrated("Label", "Features")
  .Fit(transformedNewData, originalModelParameters);

Now that we have a Machine Learning model we are ready to consume it and predict some data.