TSM - Roslyn Source Generators

Daniel Costea - Senior Software Developer @ EU Agency

When using machine learning, the central point is the machine learning model. Whether you train your own machine learning model or you get one to consume in your production code, you have to know some insights about how it was trained, like the label (or target feature), the data models (input and output), and the scenario that was used for training. Along with these details, it's very important to know the accuracy of your machine learning model. You can have tools like MLOps to take care of these details, but maybe you don't.

Or you may think of another scenario, to improve the MLOps functionality by generating the input and output data models (which are strongly typed C# classes with properties and machine learning specific data annotations), or you may generate some boilerplate code to validate or to consume the machine learning model for more complex scenarios like web services, blazor, and console application.

If you are a data scientist you may want to start from scratch when training a machine learning model. However, if you are a developer you may prefer the Model Builder which is a great visual tool for helping you create an ML.NET model starting from data. Along with the machine learning model, the input and output data models are generated, as well as some boilerplate code to train and consume the model.

If you want to build your ML.NET model in command-line, you can use the ML.NET CLI tool which covers pretty much all the scenarios you can find in Model Builder.

dotnet tool install -g mlnet

Therefore, why do we need another approach to get the input and output data models or some boilerplate code to train and consume a machine learning model?

We are about to find out later in the article, but let me first introduce the Source Generators.

Introducing Roslyn Source Generators

Roslyn is a set of open-source compilers and code analysis API for .NET. Roslyn source generators (available with C#9) enable compile time meta-programming, which means code that can be created at compile time and added to the compilation.

By definition, meta-programming is a programming technique in which computer programs have the ability to treat other programs as their data. It means that a program can be designed to read / generate / analyze / or transform other programs and even modify itself while running.

From a C# perspective, source generators enable us to:

hook into the compilation pipeline that is executed for our projects
analyze source files to be compiled
provide additional source code for our final assembly (but with some restrictions, for example it cannot change the existing code or other generated code!)

From the performance perspective, this is the most important thing, because this is compile-time metaprogramming. Of course, you know another type of metaprogramming which is reflection, but reflection runs at runtime.

Let's see what happens in the compiler (from a very high-level perspective):

It reads the C# source code file from the disk.
It parses the text from the C# file and it turns it into an object model.
It builds the concrete syntax-tree from the object model (please note here, it contains anything, keywords, white spaces, so you can go back from a syntax-tree to source code).
The syntax tree is sent into the compilation phase.
The compilation now has your syntax tree and it has your symbolic information.
The syntax tree and the symbolic information are sent to a source generator.
Source generator emits the code.
The generated source code is added to the compilation of the main project.

Please note that there is no mechanism to delete or replace any existing or already generated source code.

You can inspect the syntax-tree in a programmatic way, from your Source generator class, but if you want to see how it looks, you can check the https://sharplab.io/ or you can even build one, using the Roslyn quoter.

Let's imagine a scenario

You are ready to integrate and to consume the model in your code.

But, bad luck, you have only the ML.NET model (which is in fact a *.zip file) and you have no code associated with it (or generated for it). It's hard to keep track of everything when things are spread over multiple folders and projects. So what can you do?

Of course, you can take a peek inside the ML.NET model *.zip for the schema file. The schema includes all the feature names and types, and you can use this information to write your C#, and build the classes for input and output data models, by hand.

But for a native ML.NET model, the schema file is a binary file and it's not too elegant to extract what you need from a binary file. It is quite likely that you may prefer a programmatic approach to perform these steps. This article will present you a practical use-case to start from a ML.NET model *.zip file and to generate the missing parts, the input and output data models and some boilerplate code to train and consume the machine learning model in various ways.

As I mentioned before, the ML.NET model is a .zip file. The .zip file contains files with data about weights and biases resulting from the training, and contains the schema binary file.

We want to understand the schema header, to extract the name and the type of the features and the associated annotations if any.

The next table will describe the structure of the schema file.

Offsets	Type	Name and Description
0	ulong	Signature: The magic number of this
		file.
8	ulong	Version: Indicates the version of the
		data file.
16	ulong	CompatibleVersion: Indicates the minimum
		reader version that can interpret this
		file, possibly with some data loss.
24	long	TableOfContentsOffset: The offset to the
		column table of contents structure.
32	long	TailOffset: The eight-byte tail
		signature starts at this offset. So, the
		entire dataset stream should be
		considered to have byte length of eight
		plus this value.
40	long	RowCount: The number of rows in this
		data file.
48	int	ColumnCount: The number of columns in
		this data file.

This is useful to generate the input and output data models, but we still need to extract the scenario that was used for training the model. In order to figure out the training scenario, if it's a binary-classification, multi-classification, regression or other, we still need to find what the Label (or the target feature) is.

Since I was not successful finding information about the scenario and about the label inside the ML.NET model, I have added to the AdditionalFiles in the appsettings.csproj file some custom attributes, called Scenario and Label.

Let's get our hands dirty with some code

Ideally the schema can be read from a ML.NET model by the next piece of code:

ITransformer model = mlContext.Model
  .Load("C:/Temp/MLModel0.zip", out var modelSchema);

Unfortunately this was not very successful since some dependencies of the Microsoft.ML library are not loading correctly, so I had to follow the approach described by this article, namely reading the model schema from the *.zip file.

A source generator, either a project, a dll or a nuget package, is a netstandard 2.0 library with dependencies to Microsoft.CodeAnalysis.Analyzers and Microsoft.CodeAnalysis.CSharp libraries. The C# language used for a source generator is C# 9, so you need to specify that in the LangVersion attribute by "preview" or "9.0".

From the coding perspective a source generator is a class annotated with Generator attribute and implementing ISourceGenerator interface for the Initialize and Execute methods.

[Generator]
public class DataModelsGenerator : ISourceGenerator
{
  public void Initialize(
   GeneratorInitializationContext context)
    {
    }

  public void Execute(
   GeneratorExecutionContext context)
    {
    }
}

The project which uses the source generators has a reference to the source generator library. The reference includes OutputItemType="Analyzer" and ReferenceOutputAssembly="false" attributes, which means the generated dll is not added to the built library and it's working like an analyzer. If you are familiar with Roslyn analyzers you know exactly what that means.

By default, the generated C# classes are not emitted as physical files, but we can fix that by setting a few properties, including the output path for the emitted files, as follows:

<LangVersion>preview</LangVersion>

<EmitCompilerGeneratedFiles>
  true
</EmitCompilerGeneratedFiles>

<CompilerGeneratedFilesOutputPath>
 $(BaseIntermediateOutputPath)GeneratedMLNETFiles
</CompilerGeneratedFilesOutputPath>

Down to this point, we have the skeleton of a project using a source generator. Let's proceed to adding the functionality which reads the machine learning model schema from the ML.NET *.zip file, and generates the C# input and output data models based on the schema information like feature names and types.

[Generator]
public class DataModelsGenerator : ISourceGenerator
{
  const string ModelInput = nameof(ModelInput);
  const string ModelOutput = nameof(ModelOutput);
  const string Predictor = nameof(Predictor);
  const string Program = nameof(Program);

  public void Initialize(
  GeneratorInitializationContext context) { }

  public void Execute(
  GeneratorExecutionContext context)
  {
    (Scenario? scenario, _) = 
      GetAdditionalFileOptions(context);

    var zipFiles = context.AdditionalFiles
     .Where(f => Path.GetExtension(f.Path)
     .Equals(„.zip”, 
        StringComparison.OrdinalIgnoreCase));

     var zipFile = zipFiles.ToArray()[0].Path;

     Stream zip = null;

     zip = StreamHelper.GetZipFileStream(zipFile);
     using var reader = new BinaryReader(zip, 
       Encoding.UTF8);

     var features = StreamHelper
       .ExtractFeatures(reader);

     StringBuilder modelInputBuilder = SyntaxHelper
       .ModelInputBuilder(features, ModelInput);

     SourceText sourceText1 = SourceText
       .From(modelInputBuilder.ToString(), 
        Encoding.UTF8);

     context.AddSource($”{ModelInput}.cs”, 
     sourceText1);

     StringBuilder modelOutputBuilder = SyntaxHelper
    .ModelOutputBuilder(ModelOutput, scenario.Value);

     SourceText sourceText2 = SourceText
    .From(modelOutputBuilder.ToString(), 
       Encoding.UTF8);

     context.AddSource($”{ModelOutput}.cs”, 
     sourceText2);

     StringBuilder clientBuilder = SyntaxHelper
     .PredictorBuilder(Predictor, zipFile);

     SourceText sourceText3 = SourceText
     .From(clientBuilder.ToString(), Encoding.UTF8);

     context.AddSource($”{Predictor}.cs”, 
       sourceText3);

     StringBuilder webapiBuilder = SyntaxHelper
      .ProgramBuilder(Program, zipFile);

     SourceText sourceText4 = SourceText
      .From(webapiBuilder.ToString(), Encoding.UTF8);

     context.AddSource($”{Program}.cs”, sourceText4);
   }

  private (Scenario?, AdditionalText) 
  GetAdditionalFileOptions(
    GeneratorExecutionContext context)
   {
    var file = context.AdditionalFiles.First();
    if (Path.GetExtension(file.Path).Equals(„.zip”,
    StringComparison.OrdinalIgnoreCase))

     {
      context.AnalyzerConfigOptions.GetOptions(file)
      .TryGetValue(„build_metadata.additionalfiles
      .Scenario”, out string scenarioValue);

      Enum.TryParse(scenarioValue, true, 
      out Scenario scenario);

         return (scenario, file);
     }

     return (null, null);
  }
}

The SyntaxHelper.* methods are not included in the above code, but you can find the whole solution on github: https://github.com/dcostea/Dragonfly

Now that we have the source generator class, we can reference and use it in our project.

This is a nice, freshly generated C# code using the information from the schema binary file from the ML.NET *.zip model.

using System;
using Microsoft.ML.Data;

namespace GeneratedDataModels
{
  public class ModelInput
  {
    [LoadColumn(0)]
    public float Temperature { get; set; }

    [LoadColumn(1)]
    public float Luminosity { get; set; }

    [LoadColumn(2)]
    public float Infrared { get; set; }

    [LoadColumn(3)]
    public float Distance { get; set; }

    [LoadColumn(4)]
    public string CreatedAt { get; set; }

    [LoadColumn(5)]
    public string Label { get; set; }

  }
}

using System;
using Microsoft.ML.Data;

namespace GeneratedDataModels
{
  public class ModelOutput
  {
    [ColumnName("PredictedLabel")]
    public string Prediction { get; set; }

    public float[] Score { get; set; }
  }
}

Now it is a piece of cake to generate some boilerplate code for consuming the ML.NET model, just by starting from the ML.NET *.zip file alone!

using System;
using Microsoft.ML;

namespace GeneratedDataModels
{
  public class Predictor
  {
    public static ModelOutput Predict(ModelInput
     sampleData)
    {
      MLContext mlContext = new MLContext(seed: 1);
      ITransformer model = mlContext.Model
        .Load(„C:/Temp/MLModel1.zip”, 
         out var modelSchema);

      var predictor = mlContext.Model
      .CreatePredictionEngine<ModelInput, 
       ModelOutput>(model);

      var predicted = predictor.Predict(sampleData);
      return predicted;
    }
  }
}

Or maybe you would like it in a webapi service flavor:

using Microsoft.Extensions.ML;
using Microsoft.AspNetCore.Builder;
using Microsoft.Extensions.DependencyInjection;
using System.Text.Json;
using System.Threading.Tasks;

namespace GeneratedDataModels
{
  class Program
  {
    static void Main(string[] args)
    {
      WebHost.CreateDefaultBuilder()
        .ConfigureServices(services => {
          services.AddPredictionEnginePool
            <ModelInput,ModelOutput>()
            .FromFile(„C:/Temp/MLModel1.zip”);
        })
        .Configure(app => {
          app.UseHttpsRedirection();
          app.UseRouting();
          app.UseEndpoints(routes => {
            routes.MapPost(„/predict”, 
            PredictHandler);
          });
        })
        .Build()
        .Run();
    }

    static async Task PredictHandler(HttpContext http)
    {
      var predEngine = http.RequestServices
      .GetRequiredService<PredictionEnginePool
      <ModelInput,ModelOutput>>();

      var input = await JsonSerializer
        .DeserializeAsync<ModelInput>
        (http.Request.Body);

      var prediction = predEngine.Predict(input);
      await http.Response.WriteAsJsonAsync(prediction);
    }
  }
}

Ok, I know I may have exaggerated with the volume of code above, but I want you to know that source generators are simply wonderfully helpful for writing redundant and boring code and all of it is done at compile time! Indeed you can think of it like a code analyzer.

We don't have to stop here, some nice code can be generated to validate the ML.NET model, to measure the quality of the model or to suggest some model training pipelines.

Another use-case can be to create your own model builder starting from the header of your dataset. Your imagination is the limit.

TSM - Roslyn Source Generators

Daniel Costea - Senior Software Developer @ EU Agency

Introducing Roslyn Source Generators

Let's imagine a scenario

Let's get our hands dirty with some code

Resources