Removing duplicate rows in pandas involves identifying repeated records in a DataFrame and eliminating them with functions like duplicated() and drop_duplicates().
In pandas, a duplicate row is any record in a DataFrame
where a subset—or all—of the column values exactly matches a previous row. When you ingest raw files, join data sources, or iterate through data transformations, duplicates can appear and skew counts, aggregates, and machine-learning models.
Failing to resolve duplicate rows leads to:
Cleaning duplicates is therefore a critical data-quality step in ETL pipelines, exploratory analysis, and production workloads.
DataFrame.duplicated()
returns a Boolean Series marking rows as True
if they are duplicates of earlier rows. Useful for inspection, filtering, or conditional logic.
DataFrame.drop_duplicates()
removes duplicate rows and returns a new DataFrame (or modifies in place with inplace=True
). Core parameters:
subset
– list of columns to consider when verifying duplicateskeep
– 'first'
(default), 'last'
, or False
inplace
– whether to modify the original DataFrameimport pandas as pd
data = {
"order_id": [1, 2, 2, 3, 4, 4, 4],
"customer": ["A", "B", "B", "C", "D", "D", "D"],
"paid_usd": [100, 50, 50, 75, 30, 30, 30],
}
df = pd.DataFrame(data)
print("Original\n", df)
# 1. Flag duplicates by all columns
flags = df.duplicated()
print("\nFlags all cols\n", flags)
# 2. Remove duplicates based on all columns, keep first
clean = df.drop_duplicates()
print("\nClean all cols\n", clean)
# 3. Remove duplicates based on order_id only, keep last
clean_last = df.drop_duplicates(subset=["order_id"], keep="last")
print("\nClean on ID keep last\n", clean_last)
If you need the most recent record, sort by updated_at
first, then run drop_duplicates(keep='first')
to retain the latest version.
For summarization rather than row removal, groupby
across keys and aggregate.
latest_notes = (
df.sort_values("timestamp")
.groupby("order_id")
.tail(1)
)
When dealing with tens of millions of rows, selectively project the key columns to minimize memory usage:
keys = df[["id", "updated_at"]]
mask = keys.duplicated()
subset = df.loc[~mask]
subset
for clarity; implicit all-column checks can hide issues.inplace=True
.The pandas index itself can contain duplicates; always verify with index.is_unique
.
keep=False
removes all copies, which can erase desired data. Backup or review flags before committing.
Strings like '1'
and numbers 1
may look identical but won’t deduplicate. Cast types consistently first.
Although pandas runs in Python, deduplication is often the staging step before loading cleaned data to a relational data warehouse. When that data lands in a database, engineers may query it with a SQL editor such as Galaxy. By resolving duplicates before loading, you prevent downstream analysts from writing unwieldy SELECT DISTINCT
statements and keep Galaxy queries lean and performant.
Removing duplicate rows is indispensable to trustworthy analytics. With duplicated()
, drop_duplicates()
, and a few best practices, pandas makes the task straightforward, scalable, and reproducible.
Duplicate records inflate metrics, corrupt machine-learning training sets, and waste storage. In production ETL pipelines, removing them in pandas ensures that only clean, unique data is analyzed or loaded to downstream systems.
Use df[df.duplicated(subset=cols, keep=False)]
to filter all duplicates while preserving the DataFrame unchanged.
duplicated()
creates a Boolean mask that flags duplicates; drop_duplicates()
actually removes them (optionally in place).
Yes. Pass a list to subset
, e.g., df.drop_duplicates(subset=["first_name", "last_name"])
.
Galaxy focuses on SQL editing. While it doesn’t deduplicate pandas DataFrames directly, cleaning data in pandas before loading to a database ensures Galaxy users write simpler, faster SQL queries.