Calculating Retention Cohorts with SQL

How do I calculate retention cohorts with SQL?

Retention cohort analysis groups users by their acquisition period and tracks how many of them return or stay active over subsequent periods, letting teams measure product stickiness.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Retention Cohorts Explained

Retention cohort analysis groups users by the moment they first appear in your data—usually a signup date—and then measures how many of those same users perform a follow-up action (log-in, purchase, session) in later time buckets. By comparing the size of a cohort to its later activity counts you can answer questions like “What percentage of February sign-ups are still active three months later?”

Why SQL Is Perfect for Cohorts

Relational databases already store time-stamped user events. SQL’s grouping, date functions, and window functions make it easy to:

Assign each user to an acquisition cohort.
Roll activity events into uniform grain (day, week, month).
Calculate the elapsed period between acquisition and activity.
Compute retention rates with one query that runs inside your warehouse—no exports, no spreadsheets.

Key Concepts

Acquisition Event

The event that establishes cohort membership, e.g., users.created_at.

Activity Event

The follow-up behavior you care about—sessions, orders, messages sent.

Time Grain

The bucket size (day, week, month) applied consistently to both acquisition and activity timestamps.

Period Index

An integer representing how many buckets have passed since acquisition. Period 0 is the signup bucket.

Data Requirements

You need two tables (or one denormalized table):

users – one row per user with a created_at timestamp.
events – many rows per user capturing all activity (event name, event_time).

Ensure timestamps are in the same time zone and indexed for performance.

Step-by-Step Methodology

1. Define Cohort

Bucket created_at into your chosen grain with DATE_TRUNC (Postgres/Snowflake) or TRUNC (BigQuery).

2. Roll Up Activity

Similarly bucket each qualifying activity event.

3. Join Acquisition to Activity

Connect users to their activity rows via JOIN, keeping only events on or after the acquisition date.

4. Calculate Period Index

Use DATE_PART, TIMESTAMP_DIFF, or plain arithmetic on epoch days to get an integer offset.

5. Count Distinct Users

Aggregate by cohort_date and period_index.

6. Compute Retention %

With a window function, divide each period’s active count by period 0’s count for the same cohort.

Canonical SQL Pattern (Postgres)

WITH signups AS ( SELECT user_id, DATE_TRUNC('month', created_at)::date AS cohort_month FROM users ), activity AS ( SELECT user_id, DATE_TRUNC('month', event_time)::date AS activity_month FROM events WHERE event_name = 'session' ), cohorts AS ( SELECT s.cohort_month, a.activity_month, DATE_PART('month', AGE(a.activity_month, s.cohort_month)) AS months_since_signup, COUNT(DISTINCT a.user_id) AS active_users FROM signups s JOIN activity a USING (user_id) WHERE a.activity_month >= s.cohort_month GROUP BY 1,2,3 ) SELECT cohort_month, months_since_signup, active_users, FIRST_VALUE(active_users) OVER (PARTITION BY cohort_month ORDER BY months_since_signup) AS cohort_size, ROUND(active_users * 100.0 / FIRST_VALUE(active_users) OVER (PARTITION BY cohort_month ORDER BY months_since_signup), 2) AS retention_rate FROM cohorts ORDER BY cohort_month, months_since_signup;

The result is a flat table you can pivot in BI or graph as a heatmap.

Best Practices

Use consistent grain. Mixing weekly cohorts with monthly activity skews rates.
Filter bots & test accounts. Garbage in, garbage out.
Snapshot cohort size. Store period 0 counts as a fact table to keep historical accuracy when backfilling late data.
Add a calendar dimension. Speeds joins and makes period math vendor-agnostic.
Automate refreshes. Schedule daily jobs so dashboards stay current.

Common Pitfalls

Counting events instead of users

Retention measures unique users. Summing events inflates rates. Always COUNT(DISTINCT user_id).

Including pre-signup activity

Failing to filter activity_date >= signup_date will yield negative period indices and nonsense retention.

Shifting grains mid-analysis

Analysts sometimes switch from monthly cohorts to weekly activity when zooming in—this produces non-comparable percentages.

End-to-End Example

Imagine an e-commerce app with monthly cohorts. February sign-ups (15,000 users) generate:

Period 0 (Feb): 15,000 active (100 %)
Period 1 (Mar): 7,800 active (52 %)
Period 2 (Apr): 4,050 active (27 %)

The query above produces these numbers, and you can visualize them as a row in a retention matrix.

Visualizing Retention

Warehouses output a tall table; your BI tool (or a quick Python notebook) can pivot on cohort_month vs months_since_signup and color by retention_rate.

Using Galaxy for Cohorts

Galaxy’s modern SQL editor helps you build cohort queries faster:

Context-aware autocomplete suggests DATE_TRUNC, AGE, and the correct timestamp columns as you type.
AI copilot can translate “monthly retention by signup” into ready-to-run SQL, then refactor it when your schema changes.
Collections let your team endorse the final cohort query, so everyone uses the exact same logic.
Shareable links let PMs open the query in read-only mode and tweak parameters without touching production code.

Conclusion

Retention cohorts are one of the fastest ways to gauge product health. With disciplined SQL patterns and modern tooling like Galaxy, you can calculate, share, and iterate on retention metrics in minutes—no spreadsheets required.

Why Calculating Retention Cohorts with SQL is important

Retention reveals whether your product delivers ongoing value or just drives one-time sign-ups. Cohort analysis pinpoints when and why users churn, guides feature prioritization, and surfaces the ROI of marketing campaigns. Engineers and analysts who can build cohort queries directly in the warehouse avoid spreadsheet errors, automate reporting, and empower teams to make data-backed decisions faster.

Calculating Retention Cohorts with SQL Example Usage

Calculating Retention Cohorts with SQL Syntax

Common Mistakes

Mistake: Counting total events instead of unique users. Why it’s wrong: A single user can generate many events, inflating retention. Fix: Use COUNT(DISTINCT user_id) when aggregating activity.
Mistake: Forgetting to filter activity dates that occur before the signup date. Why it’s wrong: Creates negative period indices and misclassifies users. Fix: Add WHERE activity_date >= signup_date in the join condition.
Mistake: Mixing different time grains (weekly cohorts, daily activity). Why it’s wrong: Period 0 counts may not align with later periods, distorting percentages. Fix: Choose one grain for both cohort definition and activity bucketing, and stick to it.

Frequently Asked Questions (FAQs)

What is a retention cohort?

A retention cohort is a group of users who all performed an initial action—usually signing up—within the same time bucket. Analysts then track how many users in that group return or stay active in subsequent buckets, producing a retention curve.

Which time grain should I use—day, week, or month?

Choose the grain that matches your product’s usage cycle. B2C apps often use weekly cohorts; enterprise SaaS commonly uses monthly. Consistency is more important than the specific grain.

How can I build cohort analysis faster in Galaxy?

Galaxy’s AI copilot can autogenerate the cohort SQL from a plain-English prompt, while Collections let your team endorse the final query so everyone reuses the same logic without pasting SQL in Slack.

Does cohort analysis work on event-streaming datasets?

Yes. As long as event timestamps are stored in a warehouse table, you can bucket them with DATE_TRUNC and apply the same SQL patterns described here.