Streaming Snowplow Events into BigQuery

How do I stream Snowplow events into BigQuery?

The continuous delivery of behavioral data captured by Snowplow into Google BigQuery using Pub/Sub and Dataflow so analysts can query fresh event data in seconds.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Description

Overview

Snowplow’s real-time pipeline makes it possible to collect rich, first-party behavioral data and land it inside Google BigQuery within seconds. By combining Pub/Sub for transport and a Dataflow job for transformation/loading, teams can replace brittle batch uploads with a scalable streaming architecture that keeps dashboards and models current.

Why Stream Snowplow into BigQuery?

Sub-second Insights

Product managers, growth teams, and data scientists increasingly expect dashboards to reflect today’s traffic—not yesterday’s. Streaming events directly into BigQuery eliminates the 12–24 hour lag common with daily batch loads, enabling near-real-time A/B testing, alerting, and personalization.

Serverless Scalability

Pub/Sub and Dataflow are fully managed. You never provision brokers or workers. When event volume spikes—Black Friday, viral campaigns—Google automatically scales the underlying infrastructure, so you stay ahead of user demand without engineering toil.

Unified Query Surface

BigQuery’s ANSI-SQL interface lets analysts join Snowplow data with marketing spend, CRM snapshots, or application logs in seconds. With a modern SQL editor such as Galaxy, teams can share and endorse trusted queries and iterate even faster.

Architecture

1. Trackers → Collector

JavaScript, mobile, or server-side trackers send POST /i or /tp2 requests to the Snowplow Collector (often running on Cloud Run or Kubernetes).

2. Collector → Pub/Sub Topic

The collector writes each raw event to a Google Pub/Sub topic (snowplow-raw). Payloads remain as base-64 encoded Thrift records.

3. Enrichment & Loading

A Dataflow Streaming Engine job subscribes to snowplow-raw, performs validation and enrichment (user-agent parsing, geo-IP lookup, PII pseudonymization), and outputs self-describing JSONs to a second topic (snowplow-enriched-good).

The same job (or a separate template) flattens the self-describing JSONs into BigQuery-compatible rows and inserts them into a sharded table such as events_YYYYMMDD, or better, a partitioned table clustered by event_name and device_platform.

Step-by-Step Implementation

1. Create the Pub/Sub Topics

gcloud pubsub topics create snowplow-raw gcloud pubsub topics create snowplow-enriched-good

2. Deploy the Scala Stream Collector

On Cloud Run:

gcloud run deploy sp-collector \ --image ghcr.io/snowplow/scala-stream-collector-pubsub:2.10.0 \ --set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT,STREAM=good,PORT=8080 \ --allow-unauthenticated

Point your trackers at the resulting HTTPS endpoint.

3. Launch the Dataflow Template

gcloud dataflow flex-template run sp-enrich-load-$(date +%s) \ --template-file-gcs-location gs://dataflow-templates/latest/flex/Snowplow_Streaming_BigQuery \ --region us-central1 \ --parameters \ inputTopic=projects/$PROJECT/topics/snowplow-raw,\ outputTopic=projects/$PROJECT/topics/snowplow-enriched-good,\ badRowsTopic=projects/$PROJECT/topics/snowplow-enriched-bad,\ bigQueryDataset=snowplow,\ bigQueryTable=events,\ useLegacySql=false

The template auto-creates the dataset and a partitioned table. Clustering can be added via the clusteringFields parameter.

4. Verify Inserts

Open BigQuery and run:

SELECT COUNT(*) FROM `my_project.snowplow.events` WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);

You should see an ever-increasing row count.

Best Practices

Leverage Partitioning & Clustering

Streaming inserts land in the correct partition automatically when you define the table as PARTITION BY _PARTITIONTIME. Clustering by high-cardinality columns like user_id or domain_userid keeps scans inexpensive.

Schema Evolution via Iglu

Snowplow’s self-describing event and context schemas live in an Iglu Registry. When you add a new context or bump a schema version, the loader can dynamically create the required nested column in BigQuery—no manual DDL.

Cost Controls

Use EXPORT DATA to archive cold partitions to GCS.
Set partition expiration (e.g., 180 days) on non-critical datasets.
Route malformed events to a dead_letter topic; reprocess in batches rather than re-streaming to save Dataflow hours.

Monitoring

Cloud Monitoring uptime checks plus a simple SQL probe (row counts by minute) catch ingestion stalls quickly. Combine with Pub/Sub subscription backlog alerts.

Practical Analytics Example

Suppose you want to measure funnel completion in the last 30 minutes:

WITH page_views AS ( SELECT user_id, MIN(CASE WHEN page_urlpath = '/checkout' THEN event_timestamp END) AS started, MIN(CASE WHEN page_urlpath = '/order-confirmation' THEN event_timestamp END) AS completed FROM `my_project.snowplow.events` WHERE event_name = 'page_view' AND _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) GROUP BY user_id ) SELECT COUNTIF(completed IS NOT NULL) AS orders, COUNT(*) AS checkouts, SAFE_DIVIDE(COUNTIF(completed IS NOT NULL), COUNT(*)) AS conversion_rate FROM page_views WHERE started > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 MINUTE);

With Galaxy’s AI copilot, you could draft this query with a single prompt like “give me conversion rate from checkout to order confirmation in the past 30 minutes,” then share it with your team via a Collection.

Common Mistakes & How to Fix Them

1. Ignoring Partitioning

Why it’s wrong: Streaming into a non-partitioned table forces every query to scan the entire dataset, ballooning costs.
Fix: Pre-create the table as PARTITION BY _PARTITIONTIME or let the template create a partitioned destination.

2. Not Handling Failed Inserts

Why it’s wrong: Malformed events (bad rows) clog Pub/Sub subscriptions and silently drop data.
Fix: Supply badRowsTopic; periodically drain it to GCS and re-run the Snowplow Bad Rows Transformer.

3. Skipping Schema Registry Governance

Why it’s wrong: Without versioned schemas you risk breaking downstream queries when fields appear/disappear.
Fix: Host a private Iglu Registry, enforce Pull Request reviews for schema additions, and pin tracker versions.

Galaxy & BigQuery

Once events land in BigQuery, Galaxy acts as the IDE-grade SQL editor where engineers and analysts iterate on models, save endorsed queries, and rely on context-aware autocompletion drawn from the Snowplow dataset’s metadata. The AI copilot can even suggest partition filters automatically to keep queries bill-friendly.

Next Steps

Automate Dataflow deployment with Terraform or Cloud Deploy.
Set up row-level security in BigQuery for GDPR compliance.
Pipe cleansed events back to operational systems via BigQuery Subscriptions.

Why Streaming Snowplow Events into BigQuery is important

Batch-loaded behavioral data often arrives too late for conversion optimization and alerting. Streaming Snowplow events into BigQuery provides sub-minute visibility, letting teams iterate on product features, detect incidents, and feed machine-learning models continuously—all without managing servers.

Streaming Snowplow Events into BigQuery Example Usage


SELECT user_id, event_name, event_timestamp FROM `my_project.snowplow.events` WHERE _PARTITIONTIME = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) LIMIT 100;

Common Mistakes

Ignoring partitioning: streaming into an unpartitioned table inflates query cost. Define <code>PARTITION BY _PARTITIONTIME</code> or let the loader create a partitioned table.
Not handling failed inserts: bad rows remain unseen, causing silent data loss. Always configure a bad-rows Pub/Sub topic and reprocess periodically.
Skipping schema governance: adding event fields ad-hoc can break downstream jobs. Use Iglu Registry versioning and code review for schema changes.

Frequently Asked Questions (FAQs)

Is streaming Snowplow into BigQuery expensive?

BigQuery streaming inserts cost $0.01 per 200 MB; Pub/Sub and Dataflow prices scale with throughput. Partitioning and clustering keep query costs low, and you can export cold partitions to storage.

Can I use Galaxy to query Snowplow data?

Yes. Galaxy connects to BigQuery and offers IDE-style autocomplete for Snowplow’s nested schema. The AI copilot can suggest filters, aggregations, and even convert legacy SQL to standard SQL.

What happens to bad rows?

Malformed events are routed to a bad Pub/Sub topic. You can run Snowplow’s Bad-Rows Transformer to fix and replay them, or archive for compliance audits.

How do I evolve the schema safely?

Publish each new event or context schema to an Iglu Registry, bump the version, and update the tracker. The loader adds new columns automatically while preserving historical data.