Azure Machine Learning
Service 53

Azure Machine Learning

MLPlatform

Azure Machine Learning is the platform for building, training, deploying, and operating custom models — notebooks, training compute, pipelines, a model registry, and managed inference endpoints. It is for teams that build models, as opposed to those who only call ready-made ones. If your problem needs a model trained on your data, this is where that happens.

The first AI decision on Azure is build versus call. Most applications do not need a custom model — a pre-trained AI API or an Azure OpenAI deployment solves the problem with no training. Reaching for Azure Machine Learning when a pre-built API would do is the most expensive mistake in this chapter; it is justified only when no ready-made model fits.

Workspace and Assets

The workspace is the top-level container for everything — data assets (the v2 name for datasets), environments, models, endpoints, and compute — with its own access control and lineage tracking. Assets are versioned, so a model is traceable to the data and code that produced it. This reproducibility is the difference between a model you can govern and a notebook that happened to work once.

Build vs Call — Choose the Least Effort That Fits
Azure ML
Train a custom model on your data. Most control, most effort.
Azure OpenAI
Call foundation models for generative and reasoning tasks.
Pre-trained APIs
Call ready-made vision, language, speech, document models. Least effort.
More control · buildLess effort · call
You own training, data, and the model lifecycle.Azure owns the model; you call an API.

Compute

Compute instances are managed dev boxes for notebooks; compute clusters are autoscaling pools for training and batch scoring that scale to zero when idle. Separating interactive development from elastic training compute is what keeps cost down — a cluster that scales to zero between jobs does not bill while nobody is training.

Training and Pipelines

Training runs execute on compute targets, tracked with metrics and artifacts for comparison. ML pipelines chain reusable, independently scaled steps — data prep, training, evaluation — into a repeatable workflow. Automated ML and a low-code designer lower the entry bar, while full code control remains for teams that need it.

Model Registry and Endpoints

Trained models are versioned in the registry, then deployed to managed online endpoints (real-time, autoscaling REST inference) or batch endpoints (scoring large datasets on a schedule). The endpoint abstracts the serving infrastructure, so deploying a new model version is a registry-and-endpoint operation, not a rebuild of hosting.

MLOps

Azure Machine Learning integrates with pipelines and Git for MLOps: versioned data and models, automated retraining and deployment, and monitoring for data drift in production. Treating models as governed, monitored, continuously deployed assets — rather than one-off artifacts — is what keeps a model accurate after it ships.

Azure Machine Learning vs pre-trained AI APIs vs Azure OpenAI

Azure Machine Learning — Build and train custom models on your data. Choose it only when no ready-made model fits the problem.

Pre-trained AI APIs — Call a managed model for vision, language, speech, or documents — no training. Choose it when a ready-made capability solves the problem.

Azure OpenAI — Call GPT and other foundation models for generative and reasoning tasks. Choose it for language generation, summarization, and RAG over your data.

Common Mistakes
  • Reaching for Azure Machine Learning to build a custom model when a pre-trained AI API or Azure OpenAI deployment would solve the problem with no training.
  • Leaving training compute clusters running between jobs instead of letting them scale to zero, paying for idle GPUs.
  • Skipping the model registry and asset versioning, so a production model cannot be traced to the data and code that made it.
  • Deploying a model with no drift monitoring, so accuracy quietly degrades as real-world data shifts.
  • Doing development on the training cluster instead of a compute instance, conflating interactive and elastic compute.
  • Treating a model as a one-off artifact with no MLOps pipeline, making retraining and redeployment a manual scramble.
Best Practices
  • Decide build-versus-call first; choose Azure Machine Learning only when no pre-built model fits.
  • Use compute instances for dev and autoscaling clusters (scaling to zero) for training and batch scoring.
  • Version data assets, models, and environments in the workspace for reproducibility and lineage.
  • Deploy through the model registry to managed online or batch endpoints rather than hand-built hosting.
  • Monitor deployed models for data drift and retrain through an MLOps pipeline.
  • Use automated ML or the designer to lower the entry bar where full code control is not needed.
Comparable servicesAWS SageMakerGCP Vertex AI

Knowledge Check

When is Azure Machine Learning the right choice over a pre-trained AI API?

  • When the problem needs a model trained on your own data and no ready-made model fits
  • Whenever any AI capability of any kind is needed at any point anywhere in the application stack
  • Only for text-generation tasks that produce natural language output
  • When you want to avoid managing any serving endpoints yourself

How do you keep training compute cost down in Azure Machine Learning?

  • Use autoscaling compute clusters that scale to zero between jobs, separate from dev compute instances
  • Run every single training job on one always-on compute instance that is left running and never shuts down
  • Disable the model registry to avoid its storage charges
  • Use batch endpoints to run your interactive notebook development

Why monitor a deployed model for data drift?

  • Real-world data shifts over time, quietly degrading accuracy unless drift is detected and the model retrained
  • Drift monitoring directly lowers the inference latency of the serving endpoint and speeds up every prediction it returns
  • It is a required step before you can register the model
  • It lowers what the serving endpoint costs to run

You got correct