A CloudWatch metric filter converts log patterns that indicate ETL failures into CloudWatch metrics you can alarm on.
Building a CloudWatch metric filter for failed ETL jobs means scanning the logs your pipeline already writes, extracting patterns that signal a failure, and turning those patterns into metrics. Those metrics can then trigger alarms that page your on-call engineer before data quality degrades or an SLA is missed.
ETL (Extract-Transform-Load) pipelines push data from source systems into analytics stores that power dashboards, machine-learning models, or even user-facing product features. A silent failure can:
Proactive detection via CloudWatch avoids these outcomes by combining logs you already have with the alerting features of Amazon CloudWatch Alarms.
logs:PutMetricFilter
, cloudwatch:PutMetricAlarm
).Scan historical logs and note the exact text that appears when a job fails. Glue, for example, writes entries that contain ERROR Glue Job failed
. For Apache Spark, failures often include Exception
or ExitCode = 1
. The higher-fidelity and more deterministic the pattern, the fewer false positives.
CloudWatch Logs supports a small domain-specific language for filter patterns:
"ERROR Glue Job failed"
[status="FAILED*"]
?*
for variable textExample pattern for Glue failures:
"ERROR" "Glue Job" "failed"
This pattern matches any log line that contains all three quoted substrings, regardless of order.
aws logs put-metric-filter \
--log-group-name "/aws-glue/jobs/output" \
--filter-name "GlueFailedJobs" \
--filter-pattern '"ERROR" "Glue Job" "failed"' \
--metric-transformation metricName=FailedETLJob,metricNamespace=ETL,metricValue=1
The filter generates a metric named ETL/FailedETLJob
and emits a value of 1
for every matching log event.
aws cloudwatch put-metric-alarm \
--alarm-name "Failed ETL Jobs >= 1" \
--metric-name FailedETLJob \
--namespace ETL \
--statistic Sum \
--period 300 \
--threshold 0 \
--comparison-operator > \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:etl-oncall
The alarm fires if one or more failures occur within any five-minute window.
ETL/FailedETLJob
metric increments.ALARM
state and notifies your channel (email, PagerDuty, Slack, etc.).Choose a pattern that is impossible to see in success logs. Prefix your own error strings like [FATAL_ETL]
if your framework allows custom log lines.
Structured JSON logs let you filter on a field value rather than brittle free-text. Example pattern:
{ $.status = "FAILED" }
Add a dimension (metric label) per job so you can alarm on specific pipelines:
metricName=Failed,metricNamespace=ETL,metricValue=1,defaultValue=0,dimensions={JobName}
Use evaluation-periods
> 1 or datapointsToAlarm
to ignore transient or test failures.
Document how to simulate a failure and observe the alarm so new team members can verify monitoring after each code change.
Problem: The filter matches benign log lines containing the word ERROR
, creating alert fatigue.
Solution: Narrow the pattern—add context words, use structured fields, or apply multiple conditions.
Problem: Slightly different error messages bypass the filter.
Solution: Use wildcards (*
) or omit variable substrings that can change between runs, such as IDs or timestamps.
Average
Instead of Sum
Problem: Averaging across minutes dilutes single failures to below threshold.
Solution: For discrete events, choose the Sum
statistic so each failure counts.
The snippet below provisions the log group, metric filter, and alarm in CloudFormation. Deploy it in test before production.
Resources:
GlueFailedLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: "/aws-glue/jobs/output"
GlueFailedMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterPattern: '"ERROR" "Glue Job" "failed"'
LogGroupName: !Ref GlueFailedLogGroup
MetricTransformations:
- MetricName: FailedETLJob
MetricNamespace: ETL
MetricValue: '1'
GlueFailedAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "Failed ETL Jobs >= 1"
MetricName: FailedETLJob
Namespace: ETL
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- arn:aws:sns:us-east-1:123456789012:etl-oncall
Transforming log patterns into metrics is a low-latency, low-maintenance way to monitor ETL pipelines. It requires only IAM permissions and a one-time setup, but it pays dividends every time a dataset stays healthy because you caught a failed run early.
ETL jobs sit at the heart of analytics and operational reporting. A single unnoticed failure can skew KPIs, break data contracts, and erode stakeholder trust. Metric filters let you turn your existing logs into near real-time signals without modifying job code or paying for additional observability tooling, making them a cost-effective first line of defense.
It’s a rule that scans log events for a specific pattern and, when matched, emits a custom CloudWatch metric. You can then graph or alarm on that metric.
Yes. You can add a { $.jobName = * }
condition in a JSON log or use multiple patterns. Alternatively, create one filter per log group if each job writes to its own group.
Metric filters incur standard CloudWatch Logs data-processing charges and custom metric charges (currently the first 10 metrics per account are free). For most teams, the total is only a few dollars per month.
No. If your tool already writes failure messages to CloudWatch Logs, you only add a metric filter and alarm. Code changes are optional but recommended if you want structured logs for cleaner patterns.