GROUP BY creates explicit groups usually paired with aggregate functions, while DISTINCT returns only unique rows of the selected columns without creating formal groups.
SQL GROUP BY vs DISTINCT
GROUP BY organizes rows into named groups so you can aggregate them, whereas DISTINCT filters duplicate rows out of the result set. Picking the right clause affects correctness, performance, and maintainability of your queries.
At first glance, GROUP BY
and DISTINCT
appear to do something similar: they collapse duplicate data so you see a shorter result set. The similarity is superficial. DISTINCT
is a deduplication operator; it filters the final projection of your query so every row is unique for the columns listed in the SELECT
list. GROUP BY
partitions the underlying rows into logical buckets called groups. Once groups exist you can perform aggregate calculations—SUM()
, COUNT()
, AVG()
, MIN()
, MAX()
, or user-defined aggregates—over each bucket.
Knowing when to use each clause directly affects:
DISTINCT
when you need an aggregate returns misleading results; choosing GROUP BY
when you merely need unique rows adds unnecessary computation.The DISTINCT
keyword applies to the result set after projection. Conceptually the engine performs these steps:
FROM
and WHERE
to build the row set.SELECT
list.DISTINCT
as a duplicate filter across the selected columns.The output contains no aggregates unless you explicitly add one in the SELECT
list, and the aggregate is applied to the entire deduplicated set, not per row.
SELECT DISTINCT country
FROM customers;
Because no aggregate is involved, DISTINCT
is ideal. The database scans or indexes on country
, sorts/hash-deduplicates, and stops.
GROUP BY
adds an extra step:
FROM
and WHERE
.If no aggregate functions appear, most databases still honor the grouping: you get one row per group just like with DISTINCT
. But the intent—and thus the optimizer’s choices—differ.
SELECT country, COUNT(*) AS customer_count
FROM customers
GROUP BY country;
Without GROUP BY
you cannot compute COUNT
per country in portable SQL.
You can sometimes replace one with the other:
-- Using DISTINCT
SELECT DISTINCT country, city
FROM customers;
-- Using GROUP BY (no aggregates)
SELECT country, city
FROM customers
GROUP BY country, city;
Both return unique (country
, city
) pairs. Internally the optimizer may even rewrite one form into the other. But there are subtle differences:
GROUP BY
. With DISTINCT
you can project additional columns only if they are functionally dependent on the distinct list (standard SQL forbids this but some engines allow it).GROUP BY
but before DISTINCT
.Databases usually perform a hash aggregate or sort-deduplicate on the selected columns. Memory consumption scales with the cardinality (number of unique rows).
Similar physical operators—hash or sort aggregate—are used, but the engine must also compute aggregate functions. Some databases support streaming aggregation (no intermediate hash table) if input is sorted on grouping keys.
For pure deduplication, DISTINCT
may be slightly faster because the planner can skip computing aggregates and sometimes use index-only scans. But if you need aggregates, GROUP BY
is generally faster and clearer.
DISTINCT
for deduplication, GROUP BY
for aggregation.GROUP BY
a key column and select columns functionally dependent on it; verify constraints so results remain deterministic.DISTINCT
to GROUP BY
(or vice versa) yields big wins.While results can match, DISTINCT
operates on the projection, not the raw rows, and lacks built-in aggregation semantics.
Slowness usually stems from missing indexes or mis-written queries. When aggregates are required, GROUP BY
is often more efficient.
You can, but the logic is different. COUNT(DISTINCT col)
deduplicates a single column before counting, unlike GROUP BY col
which counts each row per value of col
.
Suppose you manage a SaaS application with a page_views
table:
CREATE TABLE page_views (
user_id bigint,
session_id uuid,
page_url text,
viewed_at timestamp
);
SELECT COUNT(DISTINCT session_id)
FROM page_views
WHERE viewed_at >= CURRENT_DATE - INTERVAL '1 day'
AND viewed_at < CURRENT_DATE;
DISTINCT
is perfect: we only need the number of unique session_id
values.
SELECT user_id, COUNT(*) AS views
FROM page_views
WHERE viewed_at >= CURRENT_DATE - INTERVAL '1 day'
AND viewed_at < CURRENT_DATE
GROUP BY user_id;
Now each group is a user.
Because Galaxy is a developer-oriented SQL editor with AI assistance, it can:
DISTINCT
queries into GROUP BY
when you start adding aggregates.SELECT
list to prevent runtime errors.DISTINCT
vs GROUP BY
performance side by side.DISTINCT
filters duplicates; GROUP BY
forms groups for aggregation. Understanding the difference lets you write correct, performant SQL and leverage modern tools like Galaxy’s copilot effectively.
Choosing between GROUP BY and DISTINCT impacts query correctness, performance, and maintenance. Developers who confuse the two risk returning wrong numbers, reading misleading dashboards, and burning compute. Mastery is crucial for anyone aggregating data—especially at scale where the wrong operator might multiply costs or make dashboards lie.
No. DISTINCT removes duplicate rows after projection, while GROUP BY partitions rows so you can apply aggregate functions like SUM or COUNT to each group.
It depends. For pure deduplication, DISTINCT can be faster. When you need aggregates, GROUP BY usually wins. Always test with EXPLAIN plans.
Yes, but the semantics are tricky and rarely needed. For example, you might GROUP BY a column and use COUNT(DISTINCT other_column) to count unique values inside each group.
Galaxy’s AI copilot reviews your query context and warns when you mix non-grouped columns with aggregates or when a DISTINCT could be rewritten as a faster, clearer GROUP BY. It also surfaces execution-plan insights so you can compare alternatives.