The Fundamentals of ML.NET

What we gonna do?

You can train a model in 15 lines of Python. But can you keep that model running reliably in production — with fresh data, automated retraining, and zero Python runtime in your deployment pipeline? That gap between a working notebook and a working service is the domain of data engineering.

ML.NET is Microsoft's open-source machine learning framework built natively for .NET. It lets .NET developers own the entire machine learning lifecycle — from schema definition and data transformation to training, evaluation, and serving predictions — entirely in C#.

This article covers the three professional roles that make ML projects succeed in production, why data engineering is the hardest part of the job, and how to build a complete ML.NET training pipeline step by step — from raw CSV data to a saved regression model ready for an ASP.NET Core application.

Why we gonna do?

Most conversations about machine learning jump straight to algorithms. Pick the right model, tune the hyperparameters, hit 95% accuracy. The end. That framing obscures where projects actually fail: not in the model, but in the infrastructure that feeds it.

The Three Roles Every ML Project Needs

A data scientist explores raw data, identifies patterns, and builds prototype models. The output is a runnable experiment — often accurate, rarely production-ready. Python is the natural habitat: Jupyter notebooks and pandas make experimentation fast. The downside is that this code is typically written at the quality level of a proof-of-concept, not a maintainable service.

A data engineer takes those findings and builds the infrastructure to make them work at scale. That means automated ETL (Extract-Transform-Load) pipelines, structured data stores, scheduled retraining jobs, and versioned model artefacts. These are engineering problems, not statistics problems — and they demand the same rigour as any other production software component.

An ML engineer sits at the intersection: a software developer with enough ML knowledge to incorporate a trained model into a production application. They own the API endpoint that serves predictions, the monitoring that detects model drift, and the retraining workflow when data distributions shift.

In practice, these roles overlap considerably. A .NET developer who learns ML.NET can grow into all three — without changing languages, runtimes, or deployment strategies.

Why Data Quality Outweighs Algorithm Choice

The algorithm is rarely the bottleneck. Any reasonably chosen regression or classification algorithm converges on a useful result when the data is clean, complete, and representative. The bottleneck is almost always the pipeline that prepares the data — especially in production, where data is live, noisy, and changes over time.

Consider a concrete example: your company processes thousands of ride-sharing transactions every day. The business wants to predict the fare before the trip ends — useful for customer communication and revenue forecasting. The data exists: timestamps, distances, passenger counts, payment types. Training a model is straightforward. Keeping it accurate as data volumes grow and vendor categories change is the hard part.

Why Python Falls Short in Production .NET Systems

Python fails here not because of the algorithms but because of the runtime. Integrating a Python-trained model into a .NET application means one of three things: a REST or gRPC wrapper (adding a network hop and a separate process to manage), language bindings with limited platform support, or exporting to ONNX format with its own serialisation trade-offs. Each path adds accidental complexity that has nothing to do with the actual ML problem.

ML.NET eliminates the impedance mismatch entirely. Your schema classes, transformation logic, and inference code all live in the same .NET solution, built and deployed the same way as every other component in your stack.

How we gonna do?

The Dataset: Predicting Ride Fares

We will build a regression pipeline that predicts the fare for a ride-sharing trip based on historical transaction data. Each row in the CSV represents a completed trip with these columns:


Column              | Type    | Example
--------------------|---------|------------------
VendorId            | string  | "VendorA"
RateCode            | string  | "Standard"
PassengerCount      | float   | 2
TripTimeSeconds     | float   | 1320
TripDistanceMiles   | float   | 5.4
PaymentType         | string  | "CreditCard"
FareAmount          | float   | 18.50  <-- predict this

Three columns are categorical text: VendorId, RateCode, and PaymentType. ML algorithms require numbers — converting these correctly is a core data engineering responsibility, not an afterthought.

Step 1: Define the Data Schema

ML.NET maps CSV columns to C# properties using the LoadColumn attribute. The zero-based index matches the column position in the source file. Define two classes: one for the training input and one for the model's prediction output.


using Microsoft.ML.Data;

// Maps to one row in the training CSV file
public class RideInput
{
    [LoadColumn(0)] public string VendorId          { get; set; } = string.Empty;
    [LoadColumn(1)] public string RateCode          { get; set; } = string.Empty;
    [LoadColumn(2)] public float  PassengerCount    { get; set; }
    [LoadColumn(3)] public float  TripTimeSeconds   { get; set; }
    [LoadColumn(4)] public float  TripDistanceMiles { get; set; }
    [LoadColumn(5)] public string PaymentType       { get; set; } = string.Empty;
    [LoadColumn(6)] public float  FareAmount        { get; set; }
}

// Holds the predicted result returned by the trained model
public class RidePrediction
{
    [ColumnName("Score")]
    public float PredictedFare { get; set; }
}

Place these classes in a separate class library so both the training console app and the client application (e.g., an ASP.NET Core API) reference the same schema — no duplication, no drift between the code that trains and the code that serves.

Step 2: Load Data into IDataView

IDataView is ML.NET's core data abstraction — a lazy, cursor-based interface. Loading a large CSV does not read the entire file into memory. Data flows through row by row on demand, keeping memory pressure low even for datasets with millions of rows.


using Microsoft.ML;

var mlContext = new MLContext(seed: 42);

// No data is read from disk yet — the view is lazy
IDataView rawData = mlContext.Data.LoadFromTextFile<RideInput>(
    path: "data/rides-train.csv",
    hasHeader: true,
    separatorChar: ',');

The seed parameter makes random operations reproducible across runs — essential when you need to compare two pipeline configurations fairly.

Step 3: Remove Outliers Before Transforming

Raw data almost always contains outliers — values so far outside the expected range that they corrupt the model's learning signal. A FareAmount below ₹1 or above ₹150 almost certainly represents a data entry error or a test record. Filter these rows out before any transformation runs.


// FilterRowsByColumn uses [lowerBound, upperBound) — lower inclusive, upper exclusive.
// Keep rows where FareAmount is in [1.0, 150.0): ₹1 retained, ₹150 excluded.
IDataView cleanedData = mlContext.Data.FilterRowsByColumn(
    input: rawData,
    columnName: nameof(RideInput.FareAmount),
    lowerBound: 1.0,
    upperBound: 150.0);

Removing outliers is a data quality decision. No algorithm can compensate for training on corrupt data — getting this step right is more valuable than upgrading to a more sophisticated trainer.

Step 4: Encode Categorical Columns

OneHotEncoding converts each categorical column into a set of binary columns — one per distinct category value. This preserves the information without implying a numeric ordering that does not exist. "VendorA" is not greater or lesser than "VendorB" — one-hot encoding respects that distinction.


var categoricalPipeline = mlContext.Transforms.Categorical
    .OneHotEncoding(
        outputColumnName: "VendorIdEncoded",
        inputColumnName:  nameof(RideInput.VendorId))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding(
        outputColumnName: "RateCodeEncoded",
        inputColumnName:  nameof(RideInput.RateCode)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding(
        outputColumnName: "PaymentTypeEncoded",
        inputColumnName:  nameof(RideInput.PaymentType)));

Each Append call chains another transformation. Nothing executes yet — you are declaring the recipe, not running it. That recipe will replay identically on every future batch of incoming data.

Step 5: Normalise Numeric Columns

Numeric columns can have very different scales: PassengerCount ranges from 1 to 6 while TripTimeSeconds ranges from 60 to 7200. NormalizeMeanVariance rescales each column to have a mean of 0 and a standard deviation of 1. This prevents columns with larger absolute values from dominating the trainer's learning signal.


var numericPipeline = categoricalPipeline
    .Append(mlContext.Transforms.NormalizeMeanVariance(
        outputColumnName: nameof(RideInput.PassengerCount)))
    .Append(mlContext.Transforms.NormalizeMeanVariance(
        outputColumnName: nameof(RideInput.TripTimeSeconds)))
    .Append(mlContext.Transforms.NormalizeMeanVariance(
        outputColumnName: nameof(RideInput.TripDistanceMiles)));

Step 6: Set Up the Label and Features Columns

ML.NET trainers follow a convention: the prediction target must be named Label and all input values must be merged into a single column named Features. Use CopyColumns to rename the target and Concatenate to merge all prepared feature columns into one numeric vector.


var featurePipeline = numericPipeline
    // Rename the prediction target to the required "Label" column
    .Append(mlContext.Transforms.CopyColumns(
        outputColumnName: "Label",
        inputColumnName:  nameof(RideInput.FareAmount)))
    // Merge all prepared columns into a single "Features" vector
    .Append(mlContext.Transforms.Concatenate(
        outputColumnName: "Features",
        "VendorIdEncoded",
        "RateCodeEncoded",
        "PaymentTypeEncoded",
        nameof(RideInput.PassengerCount),
        nameof(RideInput.TripTimeSeconds),
        nameof(RideInput.TripDistanceMiles)));

This Label / Features convention is the contract between your transformation pipeline and the trainer. Once you recognise this pattern, the shape of every ML.NET regression or classification pipeline becomes predictable.

Step 7: Add a Regression Trainer

Predicting fare amount is a regression problem — you are forecasting a continuous numeric value, not a category. OnlineGradientDescent is a reliable baseline that ships with the core ML.NET package. For higher accuracy on tabular data, LightGbm (available via Microsoft.ML.LightGbm — install with dotnet add package Microsoft.ML.LightGbm) typically converges faster and generalises better.


// Solid baseline — no extra packages required
var trainer = mlContext.Regression.Trainers.OnlineGradientDescent(
    labelColumnName:   "Label",
    featureColumnName: "Features");

// Complete pipeline: transformations + trainer in one composable object
var trainingPipeline = featurePipeline.Append(trainer);

Step 8: Train and Evaluate

Calling Fit executes the entire pipeline for the first time: it applies all transformations, feeds the prepared data to the trainer, and returns a trained ITransformer. Evaluation runs on a held-out test split to measure how well the model generalises to data it has never seen during training.


// Reserve 20% of the cleaned data for evaluation
var split = mlContext.Data.TrainTestSplit(cleanedData, testFraction: 0.2);

// Fit executes all transformations and trains the model
ITransformer trainedModel = trainingPipeline.Fit(split.TrainSet);

// Apply the trained model to the test split and compute metrics
IDataView testPredictions = trainedModel.Transform(split.TestSet);

var metrics = mlContext.Regression.Evaluate(
    testPredictions,
    labelColumnName: "Label",
    scoreColumnName: "Score");

Console.WriteLine($"R²   (closer to 1.0 is better): {metrics.RSquared:F4}");
Console.WriteLine($"MAE  (lower is better):          {metrics.MeanAbsoluteError:F4}");
Console.WriteLine($"RMSE (lower is better):          {metrics.RootMeanSquaredError:F4}");

// Example output:
// R²   (closer to 1.0 is better): 0.8721
// MAE  (lower is better):          1.4230
// RMSE (lower is better):          2.1850

An R² of 0.87 means the model explains 87% of the variance in fare amounts — a strong result for a first-pass baseline. If accuracy is insufficient, the most productive next step is improving the data by adding missing features (such as pick-up zone or traffic conditions) rather than swapping algorithms.

Step 9: Save and Serve Predictions

A trained model is useless until the application that needs predictions can load it. ML.NET serialises the entire pipeline — transformations and model weights together — into a single binary file.


// ---- Trainer project: save the trained pipeline to disk ----
mlContext.Model.Save(trainedModel, cleanedData.Schema, "ride-fare-model.zip");

// ---- Client project (e.g. ASP.NET Core API): load and serve predictions ----
var loadedModel = mlContext.Model.Load(
    "ride-fare-model.zip",
    out DataViewSchema inputSchema);

// For a console app or single-threaded batch job, PredictionEngine is the simplest option.
// In an ASP.NET Core service, replace it with PredictionEnginePool (Microsoft.Extensions.ML)
// which manages a pool of thread-safe instances for concurrent request handling.
var predictor = mlContext.Model
    .CreatePredictionEngine<RideInput, RidePrediction>(loadedModel);

var sampleRide = new RideInput
{
    VendorId          = "VendorA",
    RateCode          = "Standard",
    PassengerCount    = 1,
    TripTimeSeconds   = 900,
    TripDistanceMiles = 3.8f,
    PaymentType       = "CreditCard",
    FareAmount        = 0  // unknown — this is what we are predicting
};

RidePrediction result = predictor.Predict(sampleRide);
Console.WriteLine($"Predicted fare: ${result.PredictedFare:F2}");
// Predicted fare: ₹13.45

The Predict call applies all stored transformations from the saved pipeline automatically — your input does not need to be pre-processed manually. The zip file contains everything: the OneHotEncoding vocabulary, the normalisation statistics, and the trained weights.

The Complete Pipeline at a Glance


Raw CSV data
    │
    ▼
FilterRowsByColumn          ← Remove rows where FareAmount < 1 or > 150
    │
    ▼
OneHotEncoding              ← VendorId, RateCode, PaymentType → binary columns
    │
    ▼
NormalizeMeanVariance       ← PassengerCount, TripTimeSeconds, TripDistanceMiles
    │
    ▼
CopyColumns                 ← FareAmount → "Label"
    │
    ▼
Concatenate                 ← All encoded/normalised columns → "Features"
    │
    ▼
OnlineGradientDescent       ← Regression trainer
    │
    ▼
ITransformer (trained model)
    ├── Evaluate → R², MAE, RMSE
    └── Save     → ride-fare-model.zip

Summary

Data engineering is the bridge between a prototype model and a production ML system. Here is what we covered:

Machine learning projects involve three overlapping roles — data scientist, data engineer, and ML engineer — and .NET developers can grow into all three without leaving the C# ecosystem.
IDataView is lazy and cursor-based, keeping memory usage low even for large datasets.
Use OneHotEncoding for categorical text columns and NormalizeMeanVariance for numeric columns — skipping these steps weakens the model regardless of the algorithm chosen.
FilterRowsByColumn removes outliers before training — data quality consistently outweighs algorithm choice.
The Label / Features column convention is the contract between your transformation pipeline and the trainer.
Save the trained pipeline as a .zip file so your client application can load and serve predictions without retraining. Use PredictionEnginePool from Microsoft.Extensions.ML when serving in ASP.NET Core.

Want to explore further? Read An Architectural View of ML.NET for a deeper look at MLContext, supported ML tasks, and evaluation strategies. For a hands-on computer vision example, see AI Powered Image Recognition in .NET with ML.NET and ONNX Runtime .

I ❤️ .NET