Configuring MLflow Tracking on AWS involves deploying an MLflow Tracking Server, selecting AWS services for backend storage (RDS/Aurora) and artifact storage (S3), securing access with IAM, and wiring your ML code to log experiments and models.
MLflow is the de-facto open-source platform for managing the end-to-end machine-learning (ML) lifecycle. One of its most powerful components is MLflow Tracking, which stores experiment metadata – parameters, metrics, artifacts and models. While you can run MLflow locally, production teams eventually need a scalable, persistent, and multi-user setup. That’s where AWS shines.
This guide walks through deploying a highly available MLflow Tracking stack on AWS using S3 for artifacts, Amazon RDS (or Aurora) for the metadata backend, and either an EC2 instance, AWS Batch Job or AWS Elastic Container Service (ECS) for the server process. You’ll learn the architecture, step-by-step deployment, security hardening, automation tips, common pitfalls, and a fully working Python example.
At minimum, MLflow Tracking requires a backend store (SQL database) and an artifact store (object storage). The recommended AWS reference architecture looks like this:
mlflow-artifacts-prod
) with versioning and optional SSE-KMS encryption.aws s3api create-bucket \
--bucket mlflow-artifacts-prod \
--region us-east-1 \
--create-bucket-configuration LocationConstraint=us-east-1
# Enable versioning
aws s3api put-bucket-versioning \
--bucket mlflow-artifacts-prod \
--versioning-configuration Status=Enabled
# (Optional) Default encryption with KMS
aws s3api put-bucket-encryption \
--bucket mlflow-artifacts-prod \
--server-side-encryption-configuration '{
"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "aws:kms"}}]
}'
You can click through the AWS Console, CloudFormation, or Terraform. Example Terraform snippet:
resource "aws_rds_cluster" "mlflow" {
engine = "aurora-postgresql"
engine_mode = "provisioned" # use serverless to auto-scale
master_username = var.db_user
master_password = var.db_password
database_name = "mlflow"
backup_retention_period = 7
vpc_security_group_ids = [aws_security_group.mlflow_db.id]
db_subnet_group_name = aws_db_subnet_group.private.name
storage_encrypted = true
# ...other prod settings omitted
}
# Dockerfile
FROM python:3.11-slim
RUN pip install mlflow[boto3,sqlalchemy,psycopg2-binary]
ENV MLFLOW_SERVER_PORT=5000
ENTRYPOINT ["mlflow", "server"]# Build & push to ECR
aws ecr create-repository --repository-name mlflow-server
aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com
docker build -t mlflow-server:latest .
docker tag mlflow-server:latest <account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest
docker push <account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest
{
"containerDefinitions": [
{
"name": "mlflow-server",
"image": "<account-id>.dkr.ecr.<region>.amazonaws.com/mlflow-server:latest",
"essential": true,
"portMappings": [{"containerPort": 5000, "protocol": "tcp"}],
"environment": [
{"name": "AWS_REGION", "value": "us-east-1"},
{"name": "MLFLOW_S3_ENDPOINT_URL", "value": "https://s3.amazonaws.com"}
],
"command": [
"--backend-store-uri", "postgresql+psycopg2://<user>:<password>@<cluster-endpoint>:5432/mlflow",
"--default-artifact-root", "s3://mlflow-artifacts-prod",
"--host", "0.0.0.0",
"--port", "5000"
]
}
],
"taskRoleArn": "arn:aws:iam::<account-id>:role/mlflow-task-role",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048"
}
Create a Service pointing to an ALB target group. Attach an ACM TLS cert to the ALB so users hit https://mlflow.<your-domain>.com
.
The task role needs to:
s3:PutObject
, s3:GetObject
, s3:ListBucket
on mlflow-artifacts-prod
.import mlflow
mlflow.set_tracking_uri("https://mlflow.your-domain.com")
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="xgboost-baseline"):
mlflow.log_param("max_depth", 6)
mlflow.log_param("learning_rate", 0.1)
# ...train model...
mlflow.log_metric("auc", 0.889)
mlflow.sklearn.log_model(model, "model")
print("Run logged in AWS MLflow Tracking Server!")
Aurora Serverless v2 or Multi-AZ RDS guarantees availability with minimal admin overhead.
sslmode=require
).Codify infrastructure to avoid drift and enable repeatability across stages (dev, staging, prod).
Pin MLflow versions in the Dockerfile so upgrades don’t break metadata/schema.
MLflow 2.x supports basic auth via --basic-auth
; for enterprise, front with an OAuth2 reverse proxy (e.g., oauth2-proxy
) integrated with your IdP.
Problem: Developers log models locally (file:///
) while the CI pipeline logs to S3, causing split experiment data.
Fix: Define MLFLOW_TRACKING_URI
and MLFLOW_ARTIFACT_URI
in a shared .env
or CI environment variables so every environment points to the same server.
Problem: The default db.t3.micro RDS chokes when hundreds of runs write metrics at high cadence.
Fix: Start with db.t3.small
plus IOPS, or use Aurora Serverless with autoscaling capacity 0.5–4 ACUs.
Problem: Granting s3:*
on *
to the task role increases blast radius.
Fix: Restrict actions to mlflow-artifacts-prod
bucket and use a separate bucket for other workloads.
MLflow setup does not directly involve a SQL editor. However, data engineers frequently analyze experiment tables in the backend store using SQL. If you use PostgreSQL as the MLflow backend, Galaxy’s lightning-fast SQL editor and AI copilot can help you:
experiments
, runs
, and metrics
tables quickly.By combining S3, Aurora (or RDS), and containerized compute, AWS offers a robust foundation for MLflow Tracking. Add Terraform for IaC, IAM for granular permissions, and TLS everywhere for security, and you have a production-grade ML experiment management layer ready for any team size.
Without a centralized MLflow Tracking Server, experiments live on individual laptops, hurting reproducibility, collaboration and governance. Hosting MLflow on AWS provides durable, secure, and scalable storage for artifacts and metadata so teams can confidently reproduce results, compare models, audit changes, and promote models to production.
No. You can run the MLflow Tracking server on a single EC2 instance, ECS Fargate, AWS Batch, or EKS. ECS Fargate is often the sweet spot—fully managed without forcing your team onto Kubernetes.
The server process is stateless, so you can run multiple replicas behind an ALB. Store state in RDS/Aurora and S3. Use ECS service autoscaling or a Kubernetes Horizontal Pod Autoscaler.
Yes. For artifact analysis, enable S3 data cataloging in Glue and then query with Athena. For backend metadata stored in PostgreSQL, you can connect with Galaxy or other SQL tools to analyze experiments.
Galaxy is not required, but if you analyze MLflow’s PostgreSQL backend with SQL, Galaxy provides a modern editor and AI copilot to accelerate query writing and collaboration.