Duplicate rows appear after a JOIN when the join keys are not unique or when one-to-many/many-to-many relationships are not handled with aggregation, filters, or window functions.
Why Is My Join Returning Duplicates?
A deep dive into the causes of duplicate rows after SQL joins—and proven techniques to prevent or eliminate them.
Few things are more frustrating than running a simple JOIN
only to discover the result set has ballooned with unexpected duplicates. Understanding why this happens is essential for writing correct analytics queries, optimizing ETL pipelines, and keeping dashboards trustworthy.
If either table contains multiple rows that match the join key in the other table, every matching combination is returned. This Cartesian multiplication is expected behavior—but can look like duplication when you expected a single row per entity.
Using columns that are not unique identifiers introduces ambiguity. For instance, joining only on first_name
and last_name
when multiple people share those names will naturally duplicate the parent row.
Sometimes a query joins on the correct key but forgets an additional filter (e.g., latest record, active flag, date range). Rows outside the desired slice appear as duplicates.
Upstream ETL jobs might have accidentally inserted true duplicates or null keys, amplifying the problem when joins occur downstream.
Before reaching for SELECT DISTINCT
, diagnose the root cause:
SELECT join_key, COUNT(*) FROM table GROUP BY 1 HAVING COUNT(*) > 1
Ensure each side of the join is at the same grain. Aggregate child tables first if you only care about one row per parent:
WITH order_totals AS (
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT c.*, ot.total_spent
FROM customers c
LEFT JOIN order_totals ot USING (customer_id);
SELECT DISTINCT
removes duplicates but can hide underlying data issues. A safer pattern is ROW_NUMBER()
to pick a canonical record while keeping visibility:
WITH ranked_events AS (
SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn
FROM events e
)
SELECT *
FROM ranked_events
WHERE rn = 1;
If you only need the latest status or active records, explicitly filter:
SELECT c.*, s.status
FROM customers c
JOIN status_log s ON c.id = s.customer_id
WHERE s.is_current = TRUE;
Where business logic dictates uniqueness, add constraints or indexes. This not only prevents future duplicate data but helps query planners optimize joins.
Galaxy’s SQL editor surfaces row count diffs after each run, making it obvious when a join unexpectedly explodes in size. You can:
ROW_NUMBER()
fix.COUNT(*)
diagnostic.DISTINCT
for auditability.Joining on email
alone when multiple accounts share the same address leads to duplication. Include the account_id or enforce uniqueness.
Warehouse tables often carry versioned records. Always filter to the latest version to avoid duplicate entity rows.
A LEFT JOIN
can still duplicate parent rows if the right side has multiple matches. Aggregate or deduplicate the right side first.
Duplicate rows after a join are rarely a database bug; they are a signal that the data model or query grain needs attention. By diagnosing the relationship type, validating uniqueness, and applying the right SQL patterns (aggregation, window functions, filters), you can guarantee clean, reliable join results—keeping dashboards credible and pipelines efficient.
Duplicate rows inflate metrics, slow queries, and can lead to incorrect business decisions. Understanding and preventing them ensures data accuracy, optimizes performance, and builds trust in analytics pipelines.
Run SELECT join_key, COUNT(*) FROM table GROUP BY 1 HAVING COUNT(*) > 1;
. Any returned row indicates a non-unique key.
It removes visible duplicates but hides underlying data issues. Prefer aggregation, filters, or window functions to ensure one row per entity.
Yes. Galaxy highlights row count changes after each step and its AI Copilot can suggest deduplication patterns like ROW_NUMBER()
. You can also share the query in a Collection for team review.
Indexes improve lookup speed but don’t change join multiplicity. You need unique constraints or data cleaning to prevent duplicate matches.