Removing duplicate rows in pandas refers to the process of eliminating repeated observations in a DataFrame, typically by using the drop_duplicates() method.
Removing Duplicate Rows in Pandas
Learn the theory and practice of eliminating duplicate rows in pandas—from the basics of drop_duplicates() to advanced, production-grade techniques.
Duplicate records are a silent threat to data quality. Whether they arise from system glitches, bad joins, or user error, duplicates can distort aggregates, inflate counts, and generally derail your analytics. In the pandas ecosystem, the DataFrame.drop_duplicates()
method is the primary tool for excising these anomalies. Mastering this function—and the related best practices—will help ensure your analyses are both accurate and performant.
Data engineers and analysts use pandas as a staging ground for exploratory analysis, feature engineering, and even production pipelines. A single duplicate row can:
Therefore, duplicate removal is a cornerstone of any data-quality checklist.
DataFrame.drop_duplicates()
scans the DataFrame for identical rows—or identical subsets of columns—and returns a DataFrame in which duplicates have been removed. Key parameters include:
subset
: Columns to consider when identifying duplicates.keep
: Which duplicate to keep—"first" (default), "last", or False
(drop all duplicates).inplace
: If True
, mutate the DataFrame instead of returning a copy.ignore_index
: If True
, reindex the result from 0
.Under the hood, pandas hashes each row—or the specified subset—and flags hash collisions as duplicates. This makes the operation O(n) on average, enabling fast execution even on large data sets.
Not all duplicates are equal. Sometimes an entire row is duplicated; at other times, only a business-key column is repeated. Using the subset
parameter, you can drop duplicates based on a subset of columns while preserving additional contextual data in the remaining columns.
# Drop rows where "user_id" + "event_time" repeat
clean_df = df.drop_duplicates(subset=["user_id", "event_time"])
The keep
parameter controls which record is preserved. Typical strategies include:
import pandas as pd
data = [
{"user_id": 1, "country": "US", "event_time": "2024-06-21 10:01", "amount": 25},
{"user_id": 1, "country": "US", "event_time": "2024-06-21 10:01", "amount": 25}, # duplicate
{"user_id": 2, "country": "FR", "event_time": "2024-06-21 11:15", "amount": 42},
{"user_id": 3, "country": "US", "event_time": "2024-06-21 11:59", "amount": 13},
{"user_id": 2, "country": "FR", "event_time": "2024-06-21 11:15", "amount": 42}, # duplicate
]
df = pd.DataFrame(data)
df_no_dupes = df.drop_duplicates()
print(df_no_dupes)
Result:
user_id country event_time amount
0 1 US 2024-06-21 10:01 25
2 2 FR 2024-06-21 11:15 42
3 3 US 2024-06-21 11:59 13
Suppose you consider user_id
the primary key and want to keep the latest event.
latest = (
df.sort_values("event_time")
.drop_duplicates(subset=["user_id"], keep="last")
.reset_index(drop=True)
)
print(latest)
# Remove any user_id appearing more than once
no_repeats = df[~df.duplicated(subset=["user_id"], keep=False)]
When memory is tight or the DataFrame is gigantic, set inplace=True
:
df.drop_duplicates(inplace=True)
Quantify duplicates with df.duplicated().sum()
or value_counts()
to avoid accidental data loss.
When keeping "first" or "last," define sort order explicitly so pandas doesn’t rely on load order, which may be random.
Store the untouched DataFrame or a snapshot (e.g., Parquet) before destructive operations for auditability.
In production ETL pipelines, encapsulate deduplication logic in a reusable function and validate the outcome with assertions.
Record the number of rows removed and the criteria applied, ideally in a metadata table or monitoring system.
By default, keep="first"
retains the earliest duplicate. If the most recent record is desired, you must sort accordingly and set keep="last"
.
Omitting subset
can delete rows that differ only in non-essential columns, leading to information loss.
Using inplace=True
without version control or a copy puts you at risk of irreversible changes.
Sometimes duplicates should be removed only under certain conditions—e.g., rows created within the same minute. Combine boolean masks with .duplicated()
:
mask = (
(df["event_time"].astype("datetime64[m]") == "2024-06-21 10:01") &
df.duplicated(subset=["user_id"], keep=False)
)
pruned = df[~mask]
For >50 million rows, use Dask DataFrame’s .drop_duplicates()
, which mirrors pandas syntax but performs out-of-core, parallel execution.
SQL offers SELECT DISTINCT
or window functions for deduplication. If your data lives in a database, consider removing duplicates at the source to minimize network IO. Modern SQL editors like Galaxy make crafting and testing these queries painless, after which you can pull already-clean data into pandas.
A fintech company receives daily CSV exports of transactions. Due to upstream retries, some days include repeated records. An Airflow DAG ingests the CSVs into a pandas pipeline:
def load_transactions(path: str) -> pd.DataFrame:
raw = pd.read_csv(path, parse_dates=["event_time"])
clean = (raw.sort_values("event_time")
.drop_duplicates(subset=["transaction_id"], keep="last")
.reset_index(drop=True))
assert clean["transaction_id"].duplicated().sum() == 0
return clean
The cleaned DataFrame is then pushed to a data warehouse. This simple yet robust approach has saved the company from double-counting millions of dollars in volume reports.
Removing duplicate rows in pandas is more than a single method call; it’s a deliberate process involving profiling, choosing proper keys, and validating results. By following the best practices outlined here, you’ll safeguard data quality and maintain analytic integrity.
Embed deduplication as a reusable utility in your codebase, and—if you query databases—consider leveraging Galaxy’s AI-assisted SQL editor to deduplicate at the source before data even touches pandas.
It returns a DataFrame with duplicate rows removed based on all columns or a specified subset. The original DataFrame remains unchanged unless inplace=True
.
Sort your DataFrame chronologically and call drop_duplicates(keep="last")
. Pandas will retain the final duplicate.
Only if you set inplace=True
. Otherwise, it returns a new DataFrame.
SQL often uses SELECT DISTINCT
or window functions; pandas uses drop_duplicates()
. The logic is similar, but pandas operates in-memory, whereas SQL works inside the database. Tools like Galaxy help bridge the gap by letting you craft deduplication SQL before moving data to pandas.
Duplicates distort analytics, bias machine-learning models, and violate data-integrity constraints. Efficiently removing them with pandas.drop_duplicates() is essential for reliable data pipelines and accurate insights.
It removes duplicate rows based on all or selected columns and returns a cleaned DataFrame.
Sort your DataFrame and pass keep="last" to drop_duplicates().
Only if inplace=True; otherwise, it returns a new DataFrame.
SQL deduplicates inside the database; pandas operates in-memory. Use Galaxy to craft SQL deduplication when data lives in a DB.