The continuous delivery of behavioral data captured by Snowplow into Google BigQuery using Pub/Sub and Dataflow so analysts can query fresh event data in seconds.
Snowplow’s real-time pipeline makes it possible to collect rich, first-party behavioral data and land it inside Google BigQuery within seconds. By combining Pub/Sub for transport and a Dataflow job for transformation/loading, teams can replace brittle batch uploads with a scalable streaming architecture that keeps dashboards and models current.
Product managers, growth teams, and data scientists increasingly expect dashboards to reflect today’s traffic—not yesterday’s. Streaming events directly into BigQuery eliminates the 12–24 hour lag common with daily batch loads, enabling near-real-time A/B testing, alerting, and personalization.
Pub/Sub and Dataflow are fully managed. You never provision brokers or workers. When event volume spikes—Black Friday, viral campaigns—Google automatically scales the underlying infrastructure, so you stay ahead of user demand without engineering toil.
BigQuery’s ANSI-SQL interface lets analysts join Snowplow data with marketing spend, CRM snapshots, or application logs in seconds. With a modern SQL editor such as Galaxy, teams can share and endorse trusted queries and iterate even faster.
JavaScript, mobile, or server-side trackers send POST /i
or /tp2
requests to the Snowplow Collector (often running on Cloud Run or Kubernetes).
The collector writes each raw event to a Google Pub/Sub topic (snowplow-raw
). Payloads remain as base-64 encoded Thrift records.
A Dataflow Streaming Engine job subscribes to snowplow-raw
, performs validation and enrichment (user-agent parsing, geo-IP lookup, PII pseudonymization), and outputs self-describing JSONs to a second topic (snowplow-enriched-good
).
The same job (or a separate template) flattens the self-describing JSONs into BigQuery-compatible rows and inserts them into a sharded table such as events_YYYYMMDD
, or better, a partitioned table clustered by event_name
and device_platform
.
gcloud pubsub topics create snowplow-raw
gcloud pubsub topics create snowplow-enriched-good
On Cloud Run:
gcloud run deploy sp-collector \
--image ghcr.io/snowplow/scala-stream-collector-pubsub:2.10.0 \
--set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT,STREAM=good,PORT=8080 \
--allow-unauthenticated
Point your trackers at the resulting HTTPS endpoint.
gcloud dataflow flex-template run sp-enrich-load-$(date +%s) \
--template-file-gcs-location gs://dataflow-templates/latest/flex/Snowplow_Streaming_BigQuery \
--region us-central1 \
--parameters \
inputTopic=projects/$PROJECT/topics/snowplow-raw,\
outputTopic=projects/$PROJECT/topics/snowplow-enriched-good,\
badRowsTopic=projects/$PROJECT/topics/snowplow-enriched-bad,\
bigQueryDataset=snowplow,\
bigQueryTable=events,\
useLegacySql=false
The template auto-creates the dataset and a partitioned table. Clustering can be added via the clusteringFields
parameter.
Open BigQuery and run:
SELECT COUNT(*)
FROM `my_project.snowplow.events`
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
You should see an ever-increasing row count.
Streaming inserts land in the correct partition automatically when you define the table as PARTITION BY _PARTITIONTIME
. Clustering by high-cardinality columns like user_id
or domain_userid
keeps scans inexpensive.
Snowplow’s self-describing event and context schemas live in an Iglu Registry. When you add a new context or bump a schema version, the loader can dynamically create the required nested column in BigQuery—no manual DDL.
EXPORT DATA
to archive cold partitions to GCS.dead_letter
topic; reprocess in batches rather than re-streaming to save Dataflow hours.Cloud Monitoring uptime checks plus a simple SQL probe (row counts by minute) catch ingestion stalls quickly. Combine with Pub/Sub subscription backlog alerts.
Suppose you want to measure funnel completion in the last 30 minutes:
WITH page_views AS (
SELECT
user_id,
MIN(CASE WHEN page_urlpath = '/checkout' THEN event_timestamp END) AS started,
MIN(CASE WHEN page_urlpath = '/order-confirmation' THEN event_timestamp END) AS completed
FROM `my_project.snowplow.events`
WHERE event_name = 'page_view'
AND _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY user_id
)
SELECT
COUNTIF(completed IS NOT NULL) AS orders,
COUNT(*) AS checkouts,
SAFE_DIVIDE(COUNTIF(completed IS NOT NULL), COUNT(*)) AS conversion_rate
FROM page_views
WHERE started > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 MINUTE);
With Galaxy’s AI copilot, you could draft this query with a single prompt like “give me conversion rate from checkout to order confirmation in the past 30 minutes,” then share it with your team via a Collection.
Why it’s wrong: Streaming into a non-partitioned table forces every query to scan the entire dataset, ballooning costs.
Fix: Pre-create the table as PARTITION BY _PARTITIONTIME
or let the template create a partitioned destination.
Why it’s wrong: Malformed events (bad rows) clog Pub/Sub subscriptions and silently drop data.
Fix: Supply badRowsTopic
; periodically drain it to GCS and re-run the Snowplow Bad Rows Transformer.
Why it’s wrong: Without versioned schemas you risk breaking downstream queries when fields appear/disappear.
Fix: Host a private Iglu Registry, enforce Pull Request reviews for schema additions, and pin tracker versions.
Once events land in BigQuery, Galaxy acts as the IDE-grade SQL editor where engineers and analysts iterate on models, save endorsed queries, and rely on context-aware autocompletion drawn from the Snowplow dataset’s metadata. The AI copilot can even suggest partition filters automatically to keep queries bill-friendly.
Batch-loaded behavioral data often arrives too late for conversion optimization and alerting. Streaming Snowplow events into BigQuery provides sub-minute visibility, letting teams iterate on product features, detect incidents, and feed machine-learning models continuously—all without managing servers.
BigQuery streaming inserts cost $0.01 per 200 MB; Pub/Sub and Dataflow prices scale with throughput. Partitioning and clustering keep query costs low, and you can export cold partitions to storage.
Yes. Galaxy connects to BigQuery and offers IDE-style autocomplete for Snowplow’s nested schema. The AI copilot can suggest filters, aggregations, and even convert legacy SQL to standard SQL.
Malformed events are routed to a bad
Pub/Sub topic. You can run Snowplow’s Bad-Rows Transformer to fix and replay them, or archive for compliance audits.
Publish each new event or context schema to an Iglu Registry, bump the version, and update the tracker. The loader adds new columns automatically while preserving historical data.