BigQuery integration with Airflow lets you orchestrate, schedule, and monitor BigQuery SQL jobs directly from Apache Airflow DAGs.
Running BigQuery tasks in Airflow centralizes scheduling, retries, and alerting while keeping SQL logic version-controlled alongside other data pipelines.
Create a service-account key with BigQuery permissions, add it as a Google Cloud connection in Airflow (Conn Id: google_cloud_default
), and set the JSON key path in the connection’s extra field.
BigQueryInsertJobOperator
?Use the operator to submit a query or load job. Key arguments: task_id
, configuration
(job JSON), location
, and optional gcp_conn_id
.
Embed your SQL inside the operator’s query
field or job JSON. Airflow will handle retries, backfills, and templating.
Use BigQueryInsertJobOperator
with a load
configuration pointing to the GCS URI, schema fields, and destination table.
Parameterize dataset and table names, enable write_disposition="WRITE_TRUNCATE"
carefully, store SQL in external files, and monitor job cost via the Cloud Billing export.
Missing credentials cause google.auth.exceptions.DefaultCredentialsError
; verify the service-account key path. Long-running queries may hit Airflow’s default 1-hour timeout—extend with execution_timeout
.
For most use cases, BigQueryInsertJobOperator
is sufficient. The hook is only required for advanced interactions such as polling job status outside an operator.
Yes. Airflow executes tasks concurrently as long as you respect your GCP project’s job quota and Airflow’s max_active_tasks
.
Embed Jinja templates like {{ ds }}
inside the query
. Airflow renders them at runtime, allowing date-partitions and dynamic datasets.