Duplicate rows after a SQL JOIN usually stem from one-to-many or many-to-many relationships, improper join conditions, or non-unique keys that multiply matching combinations.
Seeing more rows than you expected after a JOIN is almost always a data-model or join-logic issue, not a database bug. Understanding how row multiplication occurs will help you write faster, cleaner, and more reliable SQL.
When most engineers complain about duplicates they really mean extra rows—records that match the join criteria but exceed the desired cardinality. The physical rows may be unique in the table, but they repeat business entities in the result set. Differentiating logical duplicates from true, byte-for-byte duplicates (SELECT DISTINCT
fixes the latter) is the first diagnostic step.
If table orders
has multiple rows for a single customer_id
, joining the two tables without aggregation multiplies each customer record by the number of matching orders.
Developers often assume columns such as email
or sku
are unique when they are not. Without a unique constraint the optimizer will correctly join every matching combination, inflating the output.
Filtering conditions (WHERE
) applied before joining versus after joining can dramatically change row counts. A misplaced date filter is a classic cause of unnecessary multiplication.
LEFT or FULL OUTER JOINs that later filter on nullable columns (WHERE b.id IS NOT NULL
) implicitly convert the outer join to an inner join, producing unexpected pairings that look like duplicates.
Run SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) > 1
on each side of the join to verify uniqueness.
Strip the query down to the core FROM a JOIN b ON ...
and compare the count to your expectations.
Select the join key and a synthetic row number (ROW_NUMBER() OVER (PARTITION BY ...)
) to surface multiplicative combinations.
Look for Hash Join or Nested Loop Join operations with high row estimates. Skewed statistics often indicate a missing constraint or index.
Where possible, add PRIMARY KEY
or UNIQUE
constraints. This prevents bad data and lets the optimizer assume one-to-one cardinality.
If you only need the latest order per customer, aggregate in a subquery or CTE first:
WITH latest_order AS (
SELECT customer_id, MAX(order_date) AS last_order_dt
FROM orders
GROUP BY customer_id
)
SELECT c.*, o.last_order_dt
FROM customers c
JOIN latest_order o USING (customer_id);
Sometimes a SEMI JOIN
or using EXISTS
is conceptually correct if you merely want to know whether a related row exists.
When upstream constraints cannot be changed, apply DISTINCT
or window functions (ROW_NUMBER()... WHERE rn = 1
). Be aware this hides data issues rather than solving them.
Keep an ER diagram or data contract that specifies one-to-one versus one-to-many relationships. Treat it as code—review and version control it.
Using consistent key names (customer_id
in all tables) reduces the risk of accidental cross joins.
EXISTS
for Presence TestsThis removes extra columns from consideration and prevents accidental duplication.
A nightly job that checks key uniqueness catches data drift before dashboards break.
Assume you need the total spend per customer. A naive query:
SELECT c.customer_id,
c.name,
o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
If a customer has 42 orders, you get 42 rows. The correct approach aggregates first:
SELECT c.customer_id,
c.name,
SUM(o.total_amount) AS lifetime_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;
Galaxy’s modern SQL editor accelerates this workflow:
DISTINCT
or aggregation on the fly.Duplicate rows bloat result sets, distort aggregates, slow dashboards, and can ripple into costly business decisions. Knowing how joins amplify cardinality is foundational for data quality, query performance, and trustworthy analytics. Fixing the root cause—rather than masking with DISTINCT—keeps pipelines lean and maintainable.
Run SELECT key_col, COUNT(*) FROM table GROUP BY key_col HAVING COUNT(*) > 1
. Any returned row indicates non-uniqueness.
Not inherently, but relying on it without understanding why duplicates occur is risky. Use it as a last resort after validating the data model.
EXISTS
returns true on the first match, preventing row multiplication. An INNER JOIN returns all matches, which may create duplicates.
Galaxy’s AI copilot flags non-unique join keys, offers aggregation suggestions, and lets you compare result sets side-by-side, speeding up root-cause analysis.