Setting Up MLflow Tracking on AWS

Galaxy Glossary

How do I set up MLflow tracking on AWS?

Configuring MLflow Tracking on AWS involves deploying an MLflow Tracking Server, selecting AWS services for backend storage (RDS/Aurora) and artifact storage (S3), securing access with IAM, and wiring your ML code to log experiments and models.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

MLflow is the de-facto open-source platform for managing the end-to-end machine-learning (ML) lifecycle. One of its most powerful components is MLflow Tracking, which stores experiment metadata – parameters, metrics, artifacts and models. While you can run MLflow locally, production teams eventually need a scalable, persistent, and multi-user setup. That’s where AWS shines.

This guide walks through deploying a highly available MLflow Tracking stack on AWS using S3 for artifacts, Amazon RDS (or Aurora) for the metadata backend, and either an EC2 instance, AWS Batch Job or AWS Elastic Container Service (ECS) for the server process. You’ll learn the architecture, step-by-step deployment, security hardening, automation tips, common pitfalls, and a fully working Python example.

Why Run MLflow on AWS?

  • Unified Infrastructure – Keep data, compute, and ML metadata in the same cloud to minimize latency and egress costs.
  • Scalability & Durability – S3 provides “11 nines” durability for artifacts while Aurora/RDS handles millions of experiment rows.
  • Security & Compliance – AWS IAM, VPC, and KMS integrations help you meet SOC2, HIPAA, or GDPR requirements.
  • Cost-Effectiveness – Serverless Aurora and S3 Intelligent-Tiering keep the bill low for sporadic workloads.
  • Managed Ops – Offload backups, patching, and scaling to AWS services instead of maintaining on-prem databases.

End-to-End Architecture

At minimum, MLflow Tracking requires a backend store (SQL database) and an artifact store (object storage). The recommended AWS reference architecture looks like this:

  1. Artifact Store – Amazon S3 bucket (e.g., mlflow-artifacts-prod) with versioning and optional SSE-KMS encryption.
  2. Backend Store – Amazon Aurora Serverless v2 (PostgreSQL) or Amazon RDS PostgreSQL.
  3. Tracking Server – Containerized MLflow Tracking Server exposed through an AWS Application Load Balancer (ALB) in a private VPC subnet. Choices for compute:
    • EC2 instance (easiest to reason about).
    • AWS Fargate + ECS (fully serverless, pay-per-use).
    • EKS (Kubernetes) if you already run k8s in production.
  4. Authentication – AWS IAM Roles for Service Accounts (IRSA) or EC2 Instance Profile granting the server read/write to S3 and the database.
  5. Network Security – VPC, private subnets, security groups limiting the server to incoming traffic on port 5000 (or 443 if using TLS).
  6. TLS Termination – Configure ACM certificates on the ALB or run Nginx sidecar for HTTPS.

Step-by-Step Deployment Guide

1. Create the S3 Artifact Bucket

aws s3api create-bucket \
--bucket mlflow-artifacts-prod \
--region us-east-1 \
--create-bucket-configuration LocationConstraint=us-east-1

# Enable versioning
aws s3api put-bucket-versioning \
--bucket mlflow-artifacts-prod \
--versioning-configuration Status=Enabled

# (Optional) Default encryption with KMS
aws s3api put-bucket-encryption \
--bucket mlflow-artifacts-prod \
--server-side-encryption-configuration '{
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
}'

2. Provision the Backend Store (Aurora PostgreSQL)

You can click through the AWS Console, CloudFormation, or Terraform. Example Terraform snippet:

resource "aws_rds_cluster" "mlflow" {
engine = "aurora-postgresql"
engine_mode = "provisioned" # use serverless to auto-scale
master_username = var.db_user
master_password = var.db_password
database_name = "mlflow"
backup_retention_period = 7
vpc_security_group_ids = [aws_security_group.mlflow_db.id]
db_subnet_group_name = aws_db_subnet_group.private.name
storage_encrypted = true
# ...other prod settings omitted
}

3. Build & Push the MLflow Tracking Docker Image

# Dockerfile
FROM python:3.11-slim
RUN pip install mlflow[boto3,sqlalchemy,psycopg2-binary]
ENV MLFLOW_SERVER_PORT=5000
ENTRYPOINT ["mlflow", "server"]# Build & push to ECR
aws ecr create-repository --repository-name mlflow-server
aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com
docker build -t mlflow-server:latest .
docker tag mlflow-server:latest <account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest
docker push <account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest

4. Deploy the Service on ECS Fargate

{
"containerDefinitions": [
{
"name": "mlflow-server",
"image": "<account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest",
"essential": true,
"portMappings": [{"containerPort": 5000, "protocol": "tcp"}],
"environment": [
{"name": "AWS_REGION", "value": "us-east-1"},
{"name": "MLFLOW_S3_ENDPOINT_URL", "value": "https://s3.amazonaws.com"}
],
"command": [
"--backend-store-uri", "postgresql+psycopg2://<user>:<password>@<cluster-endpoint>:5432/mlflow",
"--default-artifact-root", "s3://mlflow-artifacts-prod",
"--host", "0.0.0.0",
"--port", "5000"
]
}
],
"taskRoleArn": "arn:aws:iam::<account-id>:role/mlflow-task-role",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048"
}

Create a Service pointing to an ALB target group. Attach an ACM TLS cert to the ALB so users hit https://mlflow.<your-domain>.com.

5. Wire IAM Permissions

The task role needs to:

  • s3:PutObject, s3:GetObject, s3:ListBucket on mlflow-artifacts-prod.
  • Secret retrieval permissions if using AWS Secrets Manager for DB creds.

6. Log Experiments from Your Notebook or Pipeline

import mlflow
mlflow.set_tracking_uri("https://mlflow.your-domain.com")
mlflow.set_experiment("churn-prediction")

with mlflow.start_run(run_name="xgboost-baseline"):
mlflow.log_param("max_depth", 6)
mlflow.log_param("learning_rate", 0.1)
# ...train model...
mlflow.log_metric("auc", 0.889)
mlflow.sklearn.log_model(model, "model")
print("Run logged in AWS MLflow Tracking Server!")

Best Practices

Use Serverless or Multi-AZ Databases

Aurora Serverless v2 or Multi-AZ RDS guarantees availability with minimal admin overhead.

Encrypt Everything

  • At Rest – KMS on S3 bucket and RDS encryption.
  • In Transit – TLS on both the ALB and the DB connection string (sslmode=require).

Automate with Terraform or CDK

Codify infrastructure to avoid drift and enable repeatability across stages (dev, staging, prod).

Version Your Docker Images

Pin MLflow versions in the Dockerfile so upgrades don’t break metadata/schema.

Enable Access Control

MLflow 2.x supports basic auth via --basic-auth; for enterprise, front with an OAuth2 reverse proxy (e.g., oauth2-proxy) integrated with your IdP.

Common Mistakes & How to Fix Them

1. Mixing Local & Remote URIs

Problem: Developers log models locally (file:///) while the CI pipeline logs to S3, causing split experiment data.
Fix: Define MLFLOW_TRACKING_URI and MLFLOW_ARTIFACT_URI in a shared .env or CI environment variables so every environment points to the same server.

2. Under-provisioned Database

Problem: The default db.t3.micro RDS chokes when hundreds of runs write metrics at high cadence.
Fix: Start with db.t3.small plus IOPS, or use Aurora Serverless with autoscaling capacity 0.5–4 ACUs.

3. Overly Broad IAM Policies

Problem: Granting s3:* on * to the task role increases blast radius.
Fix: Restrict actions to mlflow-artifacts-prod bucket and use a separate bucket for other workloads.

When Does Galaxy Matter?

MLflow setup does not directly involve a SQL editor. However, data engineers frequently analyze experiment tables in the backend store using SQL. If you use PostgreSQL as the MLflow backend, Galaxy’s lightning-fast SQL editor and AI copilot can help you:

  • Explore the experiments, runs, and metrics tables quickly.
  • Write parameterized queries to generate ad-hoc experiment reports.
  • Collaborate with team-mates by sharing endorsed SQL for model governance dashboards.

Putting It All Together

By combining S3, Aurora (or RDS), and containerized compute, AWS offers a robust foundation for MLflow Tracking. Add Terraform for IaC, IAM for granular permissions, and TLS everywhere for security, and you have a production-grade ML experiment management layer ready for any team size.

Why Setting Up MLflow Tracking on AWS is important

Without a centralized MLflow Tracking Server, experiments live on individual laptops, hurting reproducibility, collaboration and governance. Hosting MLflow on AWS provides durable, secure, and scalable storage for artifacts and metadata so teams can confidently reproduce results, compare models, audit changes, and promote models to production.

Setting Up MLflow Tracking on AWS Example Usage


mlflow.set_tracking_uri("https://mlflow.your-domain.com")

Common Mistakes

Frequently Asked Questions (FAQs)

Does MLflow Tracking require Kubernetes on AWS?

No. You can run the MLflow Tracking server on a single EC2 instance, ECS Fargate, AWS Batch, or EKS. ECS Fargate is often the sweet spot—fully managed without forcing your team onto Kubernetes.

How do I scale the MLflow Tracking server?

The server process is stateless, so you can run multiple replicas behind an ALB. Store state in RDS/Aurora and S3. Use ECS service autoscaling or a Kubernetes Horizontal Pod Autoscaler.

Can I use AWS Glue or Athena to query MLflow data?

Yes. For artifact analysis, enable S3 data cataloging in Glue and then query with Athena. For backend metadata stored in PostgreSQL, you can connect with Galaxy or other SQL tools to analyze experiments.

Is Galaxy required for MLflow?

Galaxy is not required, but if you analyze MLflow’s PostgreSQL backend with SQL, Galaxy provides a modern editor and AI copilot to accelerate query writing and collaboration.

Want to learn about other SQL terms?