Ingesting Streaming Data into BigQuery with Pub/Sub

How do I ingest streaming data into BigQuery using Pub/Sub?

Using Google Cloud Pub/Sub to deliver real-time messages that are transformed and written into BigQuery for low-latency analytics.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Ingesting Streaming Data into BigQuery with Pub/Sub

Learn how to build a fault-tolerant, low-latency pipeline that streams events from Google Cloud Pub/Sub into BigQuery, using Dataflow, Storage Write API, and best practices for schema design, partitioning, and monitoring.

What Is Pub/Sub → BigQuery Streaming?

Google Cloud Pub/Sub is a fully managed messaging service that lets you capture millions of events per second from applications, IoT devices, or micro-services. BigQuery is Google’s serverless data warehouse that supports sub-second queries over petabytes of data. Connecting the two allows you to analyze data seconds after it is generated—unlocking use cases like real-time dashboards, alerting, and ML feature serving.

Why Use Pub/Sub Instead of Batch Loads?

Latency. Pub/Sub delivers events in < 100 ms and BigQuery’s Streaming Write API makes records queryable in seconds.
Simplicity. You avoid managing Kafka clusters or cron-driven batch jobs.
Exactly-once semantics. With Dataflow (Apache Beam) as the pipeline runner, you can achieve end-to-end exactly-once writes.
Elastic scale. Both Pub/Sub and BigQuery autoscale to millions of messages per second.

High-Level Architecture

Producers publish JSON/Avro/Proto messages to a Pub/Sub topic.
Subscriptions expose the topic’s backlog to consumers.
Dataflow streaming job reads from the subscription, parses, enriches, and transforms records.
BigQuery Storage Write API (recommended) or the legacy Streaming Insert API writes rows into a partitioned, clustered table.
Downstream consumers query the table via BigQuery SQL editors such as Galaxy, Looker, or the Cloud Console.

Step-by-Step Implementation

1 Design Your BigQuery Table

Create a schema that mirrors the message payload. Use _PARTITIONTIME or _PARTITIONDATE columns based on an event timestamp, not ingestion time, to enable late-arriving data rewrites.

CREATE TABLE `prod.events_clicks` ( event_timestamp TIMESTAMP NOT NULL, user_id STRING, url STRING, referrer STRING, user_agent STRING, geo STRUCT<country STRING, city STRING> ) PARTITION BY DATE(event_timestamp) CLUSTER BY user_id;

2 Publish Events to Pub/Sub

// Node.js example await pubsub.topic('events-clicks').publishMessage({ json: { event_timestamp: Date.now(), user_id: '123', url: '/pricing', referrer: '/home', user_agent: req.headers['user-agent'], geo: { country: 'US', city: 'Austin' } } });

3 Deploy a Dataflow Streaming Job

Use either the Pub/Sub → BigQuery Storage Write API template or write custom Beam code for advanced transforms.

// Java (Beam) snippet Pipeline p = Pipeline.create(options); p.apply("ReadPubSub", PubsubIO.readStrings().fromSubscription(options.getInputSub())) .apply("ParseJSON", MapElements.into(TypeDescriptor.of(Event.class)) .via(json -> gson.fromJson(json, Event.class))) .apply("WriteBQ", BigQueryIO.write() .to("prod.events_clicks") .useBeamSchema() .withWriteDisposition(WRITE_APPEND) .withMethod(METHOD_STREAMING_INSERTS)); p.run();

For most teams the Pub/Sub to BigQuery Storage Write API template is sufficient and delivers higher throughput (up to 100 MB/s per table) with built-in exactly-once guarantees.

4 Secure the Pipeline

IAM. Grant roles/pubsub.subscriber to the Dataflow service account and roles/bigquery.dataEditor or roles/bigquery.dataOwner scoped to the target dataset.
VPC-SC. If you have regulated data, wrap Pub/Sub, Dataflow, and BigQuery in the same security perimeter.
DLP. Use Dataflow’s Cloud DLP integration to de-identify PII on the fly.

5 Monitor & Troubleshoot

Enable Dataflow job metrics such as system lag and backlog bytes.
Set up Cloud Monitoring alert policies when backlog exceeds thresholds.
Use BigQuery INFORMATION_SCHEMA.STREAMING_TIMELINE to audit streaming buffer latency.

Best Practices

Prefer Storage Write API. It removes per-row costs, supports higher throughput, and offers exactly-once semantics.
Batch rows. Publish messages in batches (e.g., 100–1000 records) to reduce Pub/Sub overhead.
Compress payloads. Gzip or use Protobuf/Avro to cut bandwidth and processing costs.
Use ordered keys if event order matters; Storage Write API supports stream_name ordering.
Schema evolution. Add new nullable fields instead of changing existing types; enable BigQuery’s schema update options in Dataflow.
Dead-letter topics. Route malformed events to a separate topic using Beam’s TupleTags.
Cost control. Use reservations for predictable BigQuery streaming costs and Pub/Sub Lite if traffic is steady.

Common Misconceptions

“Streaming inserts are always expensive.” The legacy API charges per row, but Storage Write API uses on-demand bytes like batch loads.
“Data arrives instantly in BigQuery.” Expect 1–2 seconds typical, but up to minutes during schema updates or high contention.
“I must write custom Beam code.” Google-provided Dataflow templates cover most use cases, including DLQ handling and retries.

Galaxy & Real-Time BigQuery

Once data is streaming into BigQuery, developers can point the Galaxy desktop SQL editor at the dataset. Galaxy’s AI Copilot autocompletes against the fresh schema, suggests partition filters to keep queries cheap, and lets teammates endorse canonical real-time queries in Collections. While Galaxy is not part of the ingestion path, it accelerates downstream analysis and collaboration on live streaming data.

Putting It All Together

The combination of Pub/Sub, Dataflow, and BigQuery Storage Write API offers a managed, scalable, and cost-effective pattern for real-time analytics. By following the best practices above—partitioning by event time, securing IAM roles, and monitoring lag—you can deliver production-grade pipelines with minutes of setup and minimal operations overhead.

Why Ingesting Streaming Data into BigQuery with Pub/Sub is important

Modern applications generate event data continuously—user clicks, IoT sensor readings, application logs—and stakeholders demand insights seconds later. Traditional batch ETL introduces minutes to hours of latency and operational toil. A Pub/Sub→BigQuery pipeline provides a serverless, autoscaling, and exactly-once path from event producers to an analytics-ready warehouse, enabling real-time dashboards, anomaly detection, and rapid iteration without managing infrastructure.

Ingesting Streaming Data into BigQuery with Pub/Sub Example Usage


SELECT user_id, COUNT(*) AS clicks_last_hour
FROM `prod.events_clicks`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY user_id
ORDER BY clicks_last_hour DESC;

Ingesting Streaming Data into BigQuery with Pub/Sub Syntax

Common Mistakes

Using the legacy Streaming Insert API with small row inserts. Each row incurs a cost and throttling risk. Fix: adopt the BigQuery Storage Write API or buffer messages into 500–1000-row batches.
Partitioning by ingestion time instead of event time, leading to skewed partitions and expensive rewrites when late data arrives. Fix: include an <code>event_timestamp</code> in each message and partition tables by <code>DATE(event_timestamp)</code>.
Ignoring message dead-lettering. Bad messages block pipeline progress. Fix: in Dataflow, branch malformed events to a dead-letter Pub/Sub topic and monitor it.

Frequently Asked Questions (FAQs)

Is Pub/Sub reliable enough for financial data?

Yes. Pub/Sub stores messages on disk in at least two zones, offers at-least-once delivery, and Dataflow with Storage Write API achieves exactly-once writes to BigQuery.

How much does BigQuery streaming cost?

With Storage Write API you pay only for data stored and query processing—not per row. Legacy Streaming Inserts cost $0.01 per 200 MB, so migrating can cut costs.

Can I replay historical data?

Pub/Sub retains messages for 7 days by default (31 with Extended Retention). Seek to an older ack ID or timestamp, or reload data from Cloud Storage.

How does Galaxy help after data lands in BigQuery?

Galaxy’s modern SQL editor lets teams query the live dataset, use AI Copilot to optimize partition filters, and share vetted real-time queries via Collections—no copy-pasting SQL in Slack.