Using Google Cloud Pub/Sub to deliver real-time messages that are transformed and written into BigQuery for low-latency analytics.
Ingesting Streaming Data into BigQuery with Pub/Sub
Learn how to build a fault-tolerant, low-latency pipeline that streams events from Google Cloud Pub/Sub into BigQuery, using Dataflow, Storage Write API, and best practices for schema design, partitioning, and monitoring.
Google Cloud Pub/Sub is a fully managed messaging service that lets you capture millions of events per second from applications, IoT devices, or micro-services. BigQuery is Google’s serverless data warehouse that supports sub-second queries over petabytes of data. Connecting the two allows you to analyze data seconds after it is generated—unlocking use cases like real-time dashboards, alerting, and ML feature serving.
Create a schema that mirrors the message payload. Use _PARTITIONTIME
or _PARTITIONDATE
columns based on an event timestamp, not ingestion time, to enable late-arriving data rewrites.
CREATE TABLE `prod.events_clicks` (
event_timestamp TIMESTAMP NOT NULL,
user_id STRING,
url STRING,
referrer STRING,
user_agent STRING,
geo STRUCT<country STRING, city STRING>
) PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id;
// Node.js example
await pubsub.topic('events-clicks').publishMessage({
json: {
event_timestamp: Date.now(),
user_id: '123',
url: '/pricing',
referrer: '/home',
user_agent: req.headers['user-agent'],
geo: { country: 'US', city: 'Austin' }
}
});
Use either the Pub/Sub → BigQuery Storage Write API template or write custom Beam code for advanced transforms.
// Java (Beam) snippet
Pipeline p = Pipeline.create(options);
p.apply("ReadPubSub", PubsubIO.readStrings().fromSubscription(options.getInputSub()))
.apply("ParseJSON", MapElements.into(TypeDescriptor.of(Event.class))
.via(json -> gson.fromJson(json, Event.class)))
.apply("WriteBQ", BigQueryIO.write()
.to("prod.events_clicks")
.useBeamSchema()
.withWriteDisposition(WRITE_APPEND)
.withMethod(METHOD_STREAMING_INSERTS));
p.run();
For most teams the Pub/Sub to BigQuery Storage Write API template is sufficient and delivers higher throughput (up to 100 MB/s per table) with built-in exactly-once guarantees.
roles/pubsub.subscriber
to the Dataflow service account and roles/bigquery.dataEditor
or roles/bigquery.dataOwner
scoped to the target dataset.stream_name
ordering.TupleTags
.Once data is streaming into BigQuery, developers can point the Galaxy desktop SQL editor at the dataset. Galaxy’s AI Copilot autocompletes against the fresh schema, suggests partition filters to keep queries cheap, and lets teammates endorse canonical real-time queries in Collections. While Galaxy is not part of the ingestion path, it accelerates downstream analysis and collaboration on live streaming data.
The combination of Pub/Sub, Dataflow, and BigQuery Storage Write API offers a managed, scalable, and cost-effective pattern for real-time analytics. By following the best practices above—partitioning by event time, securing IAM roles, and monitoring lag—you can deliver production-grade pipelines with minutes of setup and minimal operations overhead.
Modern applications generate event data continuously—user clicks, IoT sensor readings, application logs—and stakeholders demand insights seconds later. Traditional batch ETL introduces minutes to hours of latency and operational toil. A Pub/Sub→BigQuery pipeline provides a serverless, autoscaling, and exactly-once path from event producers to an analytics-ready warehouse, enabling real-time dashboards, anomaly detection, and rapid iteration without managing infrastructure.
Yes. Pub/Sub stores messages on disk in at least two zones, offers at-least-once delivery, and Dataflow with Storage Write API achieves exactly-once writes to BigQuery.
With Storage Write API you pay only for data stored and query processing—not per row. Legacy Streaming Inserts cost $0.01 per 200 MB, so migrating can cut costs.
Pub/Sub retains messages for 7 days by default (31 with Extended Retention). Seek to an older ack ID or timestamp, or reload data from Cloud Storage.
Galaxy’s modern SQL editor lets teams query the live dataset, use AI Copilot to optimize partition filters, and share vetted real-time queries via Collections—no copy-pasting SQL in Slack.