Load and save BigQuery tables directly from Apache Spark jobs using the Spark-BigQuery connector.
Move data between Google BigQuery and Spark to enrich analytics, offload processing, or build ML pipelines without complex ETL.
Use the open-source spark-bigquery connector (com.google.cloud.spark:spark-bigquery). It supports batch and streaming, push-down predicates, and write-disposition options.
Add --packages com.google.cloud.spark:spark-bigquery_2.12:0.36.1
(Scala 2.12) to spark-submit
or include it in your build.sbt / requirements.txt.
Service Account JSON is simplest.Set GOOGLE_APPLICATION_CREDENTIALS
or pass credentialsFile
in the Spark options.
Use spark.read.format("bigquery")
with the fully-qualified table name project.dataset.table
.Optionally filter columns or rows to minimize data scanned.
val ordersDF = spark.read.format("bigquery")
.option("table", "shop.analytics.Orders")
.load()
Call dataframe.write.format("bigquery")
and set writeMethod
(direct or indirect) plus writeDisposition
(WRITE_TRUNCATE, WRITE_APPEND, WRITE_EMPTY).
dailyRevenueDF.write.format("bigquery")
.option("table", "shop.analytics.daily_revenue")
.option("writeDisposition", "WRITE_TRUNCATE")
.save()
Yes. Use spark.conf.set("viewsEnabled","true")
and spark.conf.set("materializationDataset","temp_ds")
.Spark translates filters into BigQuery SQL, reducing data transfer.
• Partition & cluster BigQuery tables to speed reads.
• Select only needed columns.
• Prefer writeMethod=direct
for small-medium writes; use indirect
for huge datasets.
• Monitor Data Proc/Spark job and BigQuery slot usage.
See below for fixes.
.
The connector itself is open-source and free. You pay for BigQuery storage/compute and any Dataproc or Spark cluster costs.
Yes. Pass the same --packages
flag in your job configuration. Serverless automatically scales Spark executors.
Push-down works for simple filters, projections, and limits. Complex UDFs or non-deterministic expressions disable push-down.