Creating CloudWatch Metric Filters for Failed ETL Jobs

Galaxy Glossary

How do I create a CloudWatch metric filter that alerts on failed ETL job runs?

A CloudWatch metric filter converts log patterns that indicate ETL failures into CloudWatch metrics you can alarm on.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Overview

Building a CloudWatch metric filter for failed ETL jobs means scanning the logs your pipeline already writes, extracting patterns that signal a failure, and turning those patterns into metrics. Those metrics can then trigger alarms that page your on-call engineer before data quality degrades or an SLA is missed.

Why Detecting Failed ETL Jobs Matters

ETL (Extract-Transform-Load) pipelines push data from source systems into analytics stores that power dashboards, machine-learning models, or even user-facing product features. A silent failure can:

  • Break downstream reports and KPIs.
  • Corrupt dimensional models.
  • Delay data availability for customers.
  • Violate regulatory or contractual SLAs.

Proactive detection via CloudWatch avoids these outcomes by combining logs you already have with the alerting features of Amazon CloudWatch Alarms.

Prerequisites

  • Your ETL tool—AWS Glue, EMR, Airflow, dbt, or custom code—must write logs to Amazon CloudWatch Logs.
  • You have IAM permissions to create metric filters and alarms (logs:PutMetricFilter, cloudwatch:PutMetricAlarm).

Step-by-Step Implementation

1. Identify Failure Signatures

Scan historical logs and note the exact text that appears when a job fails. Glue, for example, writes entries that contain ERROR Glue Job failed. For Apache Spark, failures often include Exception or ExitCode = 1. The higher-fidelity and more deterministic the pattern, the fewer false positives.

2. Craft the Filter Pattern

CloudWatch Logs supports a small domain-specific language for filter patterns:

  • Literal text: "ERROR Glue Job failed"
  • Space-separated tokens: [status="FAILED*"]
  • Wildcard: ?* for variable text

Example pattern for Glue failures:

"ERROR" "Glue Job" "failed"

This pattern matches any log line that contains all three quoted substrings, regardless of order.

3. Create the Metric Filter

aws logs put-metric-filter \
--log-group-name "/aws-glue/jobs/output" \
--filter-name "GlueFailedJobs" \
--filter-pattern '"ERROR" "Glue Job" "failed"' \
--metric-transformation metricName=FailedETLJob,metricNamespace=ETL,metricValue=1

The filter generates a metric named ETL/FailedETLJob and emits a value of 1 for every matching log event.

4. Attach an Alarm

aws cloudwatch put-metric-alarm \
--alarm-name "Failed ETL Jobs >= 1" \
--metric-name FailedETLJob \
--namespace ETL \
--statistic Sum \
--period 300 \
--threshold 0 \
--comparison-operator > \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:etl-oncall

The alarm fires if one or more failures occur within any five-minute window.

5. Validate End-to-End

  1. Manually trigger a failing run in a dev or staging environment.
  2. Confirm the log entry appears in CloudWatch Logs.
  3. Check that the ETL/FailedETLJob metric increments.
  4. Verify the alarm moves to ALARM state and notifies your channel (email, PagerDuty, Slack, etc.).

Best Practices

Use Unique Error Tokens

Choose a pattern that is impossible to see in success logs. Prefix your own error strings like [FATAL_ETL] if your framework allows custom log lines.

Emit Structured Logs

Structured JSON logs let you filter on a field value rather than brittle free-text. Example pattern:

{ $.status = "FAILED" }

Aggregate Failures by Job Name

Add a dimension (metric label) per job so you can alarm on specific pipelines:

metricName=Failed,metricNamespace=ETL,metricValue=1,defaultValue=0,dimensions={JobName}

Suppress Flapping

Use evaluation-periods > 1 or datapointsToAlarm to ignore transient or test failures.

Create a Testing Playbook

Document how to simulate a failure and observe the alarm so new team members can verify monitoring after each code change.

Common Mistakes and How to Fix Them

Pattern Too Broad

Problem: The filter matches benign log lines containing the word ERROR, creating alert fatigue.
Solution: Narrow the pattern—add context words, use structured fields, or apply multiple conditions.

Pattern Too Narrow

Problem: Slightly different error messages bypass the filter.
Solution: Use wildcards (*) or omit variable substrings that can change between runs, such as IDs or timestamps.

Alarm on Average Instead of Sum

Problem: Averaging across minutes dilutes single failures to below threshold.
Solution: For discrete events, choose the Sum statistic so each failure counts.

Full Working Example

The snippet below provisions the log group, metric filter, and alarm in CloudFormation. Deploy it in test before production.

Resources:
GlueFailedLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: "/aws-glue/jobs/output"

GlueFailedMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
FilterPattern: '"ERROR" "Glue Job" "failed"'
LogGroupName: !Ref GlueFailedLogGroup
MetricTransformations:
- MetricName: FailedETLJob
MetricNamespace: ETL
MetricValue: '1'

GlueFailedAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "Failed ETL Jobs >= 1"
MetricName: FailedETLJob
Namespace: ETL
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- arn:aws:sns:us-east-1:123456789012:etl-oncall

Troubleshooting Guide

  • No Metric Data: Confirm the log group name in the filter exactly matches the actual log group.
  • Alarm Never Fires: Inspect the metric in CloudWatch Metrics; if no datapoints, your pattern is likely wrong.
  • False Alarms: Click into the alarm’s History tab and correlate timestamps with logs; refine pattern.

Conclusion

Transforming log patterns into metrics is a low-latency, low-maintenance way to monitor ETL pipelines. It requires only IAM permissions and a one-time setup, but it pays dividends every time a dataset stays healthy because you caught a failed run early.

Why Creating CloudWatch Metric Filters for Failed ETL Jobs is important

ETL jobs sit at the heart of analytics and operational reporting. A single unnoticed failure can skew KPIs, break data contracts, and erode stakeholder trust. Metric filters let you turn your existing logs into near real-time signals without modifying job code or paying for additional observability tooling, making them a cost-effective first line of defense.

Creating CloudWatch Metric Filters for Failed ETL Jobs Example Usage


aws logs filter-log-events --log-group-name "/aws-glue/jobs/output" --filter-pattern '"ERROR" "Glue Job" "failed"'

Common Mistakes

Frequently Asked Questions (FAQs)

What is a CloudWatch metric filter?

It’s a rule that scans log events for a specific pattern and, when matched, emits a custom CloudWatch metric. You can then graph or alarm on that metric.

Can I monitor multiple ETL jobs with one filter?

Yes. You can add a { $.jobName = * } condition in a JSON log or use multiple patterns. Alternatively, create one filter per log group if each job writes to its own group.

How much does it cost?

Metric filters incur standard CloudWatch Logs data-processing charges and custom metric charges (currently the first 10 metrics per account are free). For most teams, the total is only a few dollars per month.

Do I need to modify my ETL code?

No. If your tool already writes failure messages to CloudWatch Logs, you only add a metric filter and alarm. Code changes are optional but recommended if you want structured logs for cleaner patterns.

Want to learn about other SQL terms?