Removing duplicates in SQL refers to identifying repeated rows and eliminating extras while preserving at least one accurate record.
Learn fast, reliable techniques—DISTINCT, GROUP BY, window functions, and CTEs—to find and delete duplicate rows across popular databases.
Removing duplicates in SQL means writing queries that identify identical rows and eliminate the extras, keeping one authoritative record. Techniques vary by dialect but share a goal: data integrity.
Duplicates stem from missing constraints, batch imports, merge errors, or lack of idempotent ETL logic. Cleaning them protects analytics accuracy and downstream applications.
Use GROUPBY with HAVINGCOUNT(>1) to list keys appearing multiple times, or leverage ROW_NUMBER() window functions to flag repeats in one pass.SELECT id, COUNT(*)
FROM sales
GROUP BY id
HAVING COUNT(*) > 1;
Wrap ROW_NUMBER() in a CTE, then delete rows where row_num > 1. This keeps the first occurrence based on chosen sort criteria.WITH dup AS (
SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS rn
FROM sales)
DELETE FROM sales USING dup
WHERE sales.id = dup.id AND dup.rn > 1;
SELECT DISTINCT returns unique rows for reporting or inserts into a new table, but it doesn’t change the original data unless you replace or overwrite the table.
Create a temp table with DISTINCT rows, truncate the original table, then re-insert from the temp table. This offers a rollback plan.
Add primary keys or UNIQUE constraints, stage raw loads, validate keys during ETL, and schedule deduplication jobs as safety nets.
Postgres, SQLServer, Snowflake, Redshift, BigQuery, and MySQL8+ support window-function deletion. Older MySQL uses self-joins; Oracle needs DELETE WHERE ROWID NOT IN.
Galaxy’s AI copilot autowrites deduplication CTEs, highlights duplicate keys in the result grid, and lets teams endorse canonical queries in shared Collections.
Dirty data skews KPIs, inflates counts, and erodes trust. Deduplication ensures accurate analytics, consistent customer views, and reliable machine-learning training sets. Automated, repeatable SQL patterns let data engineers fix issues quickly without manual exports.
No. Use CTEs or subqueries with ROW_NUMBER() to target duplicates, then DELETE.
Wrapping the DELETE in BEGIN/COMMIT lets you validate row counts, ensuring a quick rollback if needed.
Yes. Galaxy’s AI copilot reads table metadata and proposes optimized CTEs, saving manual coding time.
Use a self-join on MIN(id) to preserve the first row, or upgrade to a version that supports analytic functions.