Automating the recurring extraction of MySQL data and delivering it to an Amazon S3 bucket on a daily cadence.
A daily export pipeline copies data from a MySQL database—either full tables or incremental deltas—packages it (often as CSV, Parquet, or Avro), and uploads the files to Amazon S3. The workflow usually involves three parts: extraction (SQL query), transformation/packaging, and transfer to S3. Orchestration is handled by a scheduler such as cron, AWS Data Pipeline, AWS Glue, Airflow, or Dagster.
Use mysqldump
to export full tables or databases as SQL or CSV, compress with gzip
, and push via the AWS CLI. Simple but heavy; restore times can be long.
MySQL can write query results directly to the local filesystem. Combine with aws s3 cp
or aws s3 sync
to move the file. Requires file-system access on the database host.
DMS continually replicates changes from MySQL to S3, emitting partitioned Parquet or CSV. It’s fully managed, supports CDC, and scales automatically.
Modern data stacks favor a dedicated orchestrator. Airflow DAGs can run a MySQL extraction task, store results in temporary staging (e.g., local disk or EBS), then upload to S3. Glue Jobs/Workflows perform similar jobs using PySpark or SQL.
Full loads are simpler but slow. Incremental loads rely on watermarks (timestamp, auto-increment id) or binary log replication. Capture-Change-Data (CDC) tools such as Debezium or DMS reduce operational toil.
CSV is ubiquitous; Parquet/ORC yield columnar compression and schema evolution. If you plan to query with Athena, Redshift Spectrum, or Spark, Parquet should be the default.
An S3 data lake should be organized by date partitions (/dt=2023-10-17/
) to prune scans. For incremental loads, add higher-cardinality keys (customer_id, region) only if they improve performance.
s3:PutObject
and s3:ListBucket
on your bucket.sudo yum install awscli -y
.export_mysql_s3.sh
):#!/usr/bin/env bash
set -euo pipefail
DATE=$(date +%F)
TMP_FILE="/tmp/orders_$DATE.csv.gz"
mysql -u export_user -p$MYSQL_PW -h db.internal \
-e "SELECT * FROM orders WHERE updated_at >= CURDATE()" \
--batch --quick \
| gzip > "$TMP_FILE"
aws s3 cp "$TMP_FILE" s3://company-lake/orders/dt=$DATE/
rm "$TMP_FILE"
0 2 * * * /opt/exports/export_mysql_s3.sh >> /var/log/mysql_exports.log 2>&1
aws s3 mv
to the final key after success.--single-transaction
with mysqldump
on InnoDB to avoid locking.LIMIT
/OFFSET
batches to prevent production impact.bucket-versioning --enable
) for rollback.Scanning on updated_at
without an index forces full table scans. Add a composite index (updated_at, id).
If servers have mismatched time zones, CURDATE()
based filters yield inconsistent slices. Force SET time_zone='+00:00'
at session start.
The AWS CLI automatically selects multipart uploads > 5 GB. Scripts that pipe to STDOUT
may not trigger the proper completion call. Use --expected-size
or split files.
While Galaxy is not an orchestrator, its lightning-fast SQL editor is ideal for iterating on the SELECT
statement that powers your export. Write, test, and optimize queries in Galaxy, then paste the trusted, endorsed SQL into your Airflow or Bash pipeline. Galaxy’s AI copilot can even suggest watermark predicates and detect missing indexes.
Many engineering teams need fresh production data in S3 for analytics, backups, and ML workflows. Automating a daily export removes manual toil, prevents ad-hoc dump scripts from impacting production, and unlocks cheap, scalable storage that downstream tools can query directly. A robust pipeline ensures data quality, security, and cost-efficiency while freeing up developer time for higher-value work.
Cron works for simple pipelines on a single host, but it lacks monitoring, retries, and dependency management. Use Airflow, AWS DMS, or Glue for mission-critical exports.
Yes. Galaxy’s SQL editor lets you write and test the SELECT
statement against MySQL. Once validated, you can copy the query into your ETL script or Airflow task, knowing it’s endorsed by teammates.
Maintain an updated_at
or modified_ts
column, index it, and filter on the last successful run timestamp. Alternatively, enable MySQL binlog replication and use AWS DMS or Debezium for CDC.
Parquet with Snappy compression is recommended. It’s columnar, splittable, and supports embedded schema, leading to faster scans and lower costs versus CSV.