This resource explains what a Customer Data Platform (CDP) is, why it matters, and how it works. You’ll learn core components, implementation steps, common pitfalls, and best practices. Practical SQL examples and Galaxy-specific tips show how engineers and analysts can leverage their existing skills to build or enrich a CDP.
A Customer Data Platform (CDP) is packaged software that ingests, cleans, unifies, and activates first-party customer data from multiple sources. Unlike traditional CRMs (which focus on sales interactions) or DMPs (which rely on third-party cookies), a CDP builds a persistent, unified profile for each customer by merging behavioral, transactional, and demographic data. These profiles power real-time personalization, analytics, and downstream marketing campaigns.
Collects raw events from web, mobile, backend systems, and third-party SaaS. Popular formats include JSON, CSV, and Parquet. Streaming technologies like Kafka or Kinesis often underpin real-time pipelines.
Combines multiple identifiers (cookie IDs, email, device IDs) into one customer ID using deterministic or probabilistic matching.
-- Example deterministic match
CREATE TABLE unified_ids AS
SELECT COALESCE(email_hash, device_id, cookie_id) AS customer_id,
email_hash, device_id, cookie_id
FROM raw_identity_events;
A scalable database (often columnar like Snowflake or BigQuery) holds the latest attributes for each customer. CDPs use SCD2 (slowly changing dimension type 2) patterns to maintain history.
Marketers create rule-based or ML-driven segments. Under the hood, these segments are SQL queries or materialized views.
APIs and pre-built integrations push data to email, ads, or in-app messaging platforms.
Start with concrete goals (e.g., “increase email click-through 15%”). Document the events and attributes you’ll need.
Create a lineage diagram listing each source: website, mobile app, POS, CRM, support system, etc.
-- Star schema example
CREATE TABLE dim_customer (
customer_id STRING PRIMARY KEY,
first_name STRING,
last_name STRING,
created_at TIMESTAMP,
is_subscriber BOOLEAN
);
CREATE TABLE fact_order (
order_id STRING PRIMARY KEY,
customer_id STRING REFERENCES dim_customer(customer_id),
order_total NUMERIC,
order_ts TIMESTAMP
);
Consider denormalized wide tables (customer_360
) for activation speed.
Use match rules such as email_hash
equality or deterministic lookup tables. Pro tip: store match_confidence
scores.
Cloud data warehouses (Snowflake, BigQuery) or lakehouses (Databricks) handle large volumes. Columnar storage improves segment queries.
Stream events into Kafka topics, then sink to the warehouse. Low-latency CDC (change data capture) tools like Debezium capture database changes.
Push segments to marketing tools. Measure performance and refine.
WITH latest_profile AS (
SELECT DISTINCT ON (customer_id) *
FROM dim_customer
ORDER BY customer_id, updated_at DESC
),
order_metrics AS (
SELECT customer_id,
COUNT(*) AS orders,
SUM(order_total) AS lifetime_value,
MAX(order_ts) AS last_order_ts
FROM fact_order
GROUP BY customer_id
)
SELECT p.*, o.orders, o.lifetime_value, o.last_order_ts
FROM latest_profile p
LEFT JOIN order_metrics o USING (customer_id);
CREATE OR REPLACE VIEW seg_churn_risk AS
SELECT customer_id
FROM customer_360
WHERE lifetime_value > 500
AND last_order_ts < NOW() - INTERVAL '90 days';
Most CDPs expose an endpoint like /v1/segments/{id}/export
. Retrieve seg_churn_risk
, then POST to an email service.
customer_360
and seg_churn_risk
in a shared library so marketers always hit the same source of truth.Because Galaxy connects to Snowflake, BigQuery, Postgres, and more, you can prototype and productionize CDP data models without leaving your IDE-grade editor.
PitfallFixMessy event names (e.g., Page View
vs PageView
)Implement a naming convention doc and lint events at ingestion.Over-merging identities (false positives)Use multi-key matching and confidence thresholds.Segment performance lagsAdd materialized views, increase warehouse size, or cache hot segments.GDPR deletion requests missedCreate an automated delete pipeline; log every deletion job.
customer_360
query.