Ingesting Streaming Data into BigQuery with Pub/Sub

Galaxy Glossary

How do I ingest streaming data into BigQuery using Pub/Sub?

Using Google Cloud Pub/Sub to deliver real-time messages that are transformed and written into BigQuery for low-latency analytics.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.

Description

Ingesting Streaming Data into BigQuery with Pub/Sub

Learn how to build a fault-tolerant, low-latency pipeline that streams events from Google Cloud Pub/Sub into BigQuery, using Dataflow, Storage Write API, and best practices for schema design, partitioning, and monitoring.

What Is Pub/Sub → BigQuery Streaming?

Google Cloud Pub/Sub is a fully managed messaging service that lets you capture millions of events per second from applications, IoT devices, or micro-services. BigQuery is Google’s serverless data warehouse that supports sub-second queries over petabytes of data. Connecting the two allows you to analyze data seconds after it is generated—unlocking use cases like real-time dashboards, alerting, and ML feature serving.

Why Use Pub/Sub Instead of Batch Loads?

  • Latency. Pub/Sub delivers events in < 100 ms and BigQuery’s Streaming Write API makes records queryable in seconds.
  • Simplicity. You avoid managing Kafka clusters or cron-driven batch jobs.
  • Exactly-once semantics. With Dataflow (Apache Beam) as the pipeline runner, you can achieve end-to-end exactly-once writes.
  • Elastic scale. Both Pub/Sub and BigQuery autoscale to millions of messages per second.

High-Level Architecture

  1. Producers publish JSON/Avro/Proto messages to a Pub/Sub topic.
  2. Subscriptions expose the topic’s backlog to consumers.
  3. Dataflow streaming job reads from the subscription, parses, enriches, and transforms records.
  4. BigQuery Storage Write API (recommended) or the legacy Streaming Insert API writes rows into a partitioned, clustered table.
  5. Downstream consumers query the table via BigQuery SQL editors such as Galaxy, Looker, or the Cloud Console.

Step-by-Step Implementation

1 Design Your BigQuery Table

Create a schema that mirrors the message payload. Use _PARTITIONTIME or _PARTITIONDATE columns based on an event timestamp, not ingestion time, to enable late-arriving data rewrites.

CREATE TABLE `prod.events_clicks` (
event_timestamp TIMESTAMP NOT NULL,
user_id STRING,
url STRING,
referrer STRING,
user_agent STRING,
geo STRUCT<country STRING, city STRING>
) PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id;

2 Publish Events to Pub/Sub

// Node.js example
await pubsub.topic('events-clicks').publishMessage({
json: {
event_timestamp: Date.now(),
user_id: '123',
url: '/pricing',
referrer: '/home',
user_agent: req.headers['user-agent'],
geo: { country: 'US', city: 'Austin' }
}
});

3 Deploy a Dataflow Streaming Job

Use either the Pub/Sub → BigQuery Storage Write API template or write custom Beam code for advanced transforms.

// Java (Beam) snippet
Pipeline p = Pipeline.create(options);

p.apply("ReadPubSub", PubsubIO.readStrings().fromSubscription(options.getInputSub()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(Event.class))
.via(json -> gson.fromJson(json, Event.class)))
.apply("WriteBQ", BigQueryIO.write()
.to("prod.events_clicks")
.useBeamSchema()
.withWriteDisposition(WRITE_APPEND)
.withMethod(METHOD_STREAMING_INSERTS));

p.run();

For most teams the Pub/Sub to BigQuery Storage Write API template is sufficient and delivers higher throughput (up to 100 MB/s per table) with built-in exactly-once guarantees.

4 Secure the Pipeline

  • IAM. Grant roles/pubsub.subscriber to the Dataflow service account and roles/bigquery.dataEditor or roles/bigquery.dataOwner scoped to the target dataset.
  • VPC-SC. If you have regulated data, wrap Pub/Sub, Dataflow, and BigQuery in the same security perimeter.
  • DLP. Use Dataflow’s Cloud DLP integration to de-identify PII on the fly.

5 Monitor & Troubleshoot

  • Enable Dataflow job metrics such as system lag and backlog bytes.
  • Set up Cloud Monitoring alert policies when backlog exceeds thresholds.
  • Use BigQuery INFORMATION_SCHEMA.STREAMING_TIMELINE to audit streaming buffer latency.

Best Practices

  • Prefer Storage Write API. It removes per-row costs, supports higher throughput, and offers exactly-once semantics.
  • Batch rows. Publish messages in batches (e.g., 100–1000 records) to reduce Pub/Sub overhead.
  • Compress payloads. Gzip or use Protobuf/Avro to cut bandwidth and processing costs.
  • Use ordered keys if event order matters; Storage Write API supports stream_name ordering.
  • Schema evolution. Add new nullable fields instead of changing existing types; enable BigQuery’s schema update options in Dataflow.
  • Dead-letter topics. Route malformed events to a separate topic using Beam’s TupleTags.
  • Cost control. Use reservations for predictable BigQuery streaming costs and Pub/Sub Lite if traffic is steady.

Common Misconceptions

  1. “Streaming inserts are always expensive.” The legacy API charges per row, but Storage Write API uses on-demand bytes like batch loads.
  2. “Data arrives instantly in BigQuery.” Expect 1–2 seconds typical, but up to minutes during schema updates or high contention.
  3. “I must write custom Beam code.” Google-provided Dataflow templates cover most use cases, including DLQ handling and retries.

Galaxy & Real-Time BigQuery

Once data is streaming into BigQuery, developers can point the Galaxy desktop SQL editor at the dataset. Galaxy’s AI Copilot autocompletes against the fresh schema, suggests partition filters to keep queries cheap, and lets teammates endorse canonical real-time queries in Collections. While Galaxy is not part of the ingestion path, it accelerates downstream analysis and collaboration on live streaming data.

Putting It All Together

The combination of Pub/Sub, Dataflow, and BigQuery Storage Write API offers a managed, scalable, and cost-effective pattern for real-time analytics. By following the best practices above—partitioning by event time, securing IAM roles, and monitoring lag—you can deliver production-grade pipelines with minutes of setup and minimal operations overhead.

Why Ingesting Streaming Data into BigQuery with Pub/Sub is important

Modern applications generate event data continuously—user clicks, IoT sensor readings, application logs—and stakeholders demand insights seconds later. Traditional batch ETL introduces minutes to hours of latency and operational toil. A Pub/Sub→BigQuery pipeline provides a serverless, autoscaling, and exactly-once path from event producers to an analytics-ready warehouse, enabling real-time dashboards, anomaly detection, and rapid iteration without managing infrastructure.

Ingesting Streaming Data into BigQuery with Pub/Sub Example Usage


SELECT user_id, COUNT(*) AS clicks_last_hour
FROM `prod.events_clicks`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY user_id
ORDER BY clicks_last_hour DESC;

Ingesting Streaming Data into BigQuery with Pub/Sub Syntax



Common Mistakes

Frequently Asked Questions (FAQs)

Is Pub/Sub reliable enough for financial data?

Yes. Pub/Sub stores messages on disk in at least two zones, offers at-least-once delivery, and Dataflow with Storage Write API achieves exactly-once writes to BigQuery.

How much does BigQuery streaming cost?

With Storage Write API you pay only for data stored and query processing—not per row. Legacy Streaming Inserts cost $0.01 per 200 MB, so migrating can cut costs.

Can I replay historical data?

Pub/Sub retains messages for 7 days by default (31 with Extended Retention). Seek to an older ack ID or timestamp, or reload data from Cloud Storage.

How does Galaxy help after data lands in BigQuery?

Galaxy’s modern SQL editor lets teams query the live dataset, use AI Copilot to optimize partition filters, and share vetted real-time queries via Collections—no copy-pasting SQL in Slack.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie
BauHealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.