Remove Duplicate Rows in Pandas

Galaxy Glossary

How do I remove duplicate rows in pandas?

Removing duplicate rows in pandas involves identifying repeated records in a DataFrame and eliminating them with functions like duplicated() and drop_duplicates().

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Understanding Duplicate Rows

In pandas, a duplicate row is any record in a DataFrame where a subset—or all—of the column values exactly matches a previous row. When you ingest raw files, join data sources, or iterate through data transformations, duplicates can appear and skew counts, aggregates, and machine-learning models.

Why You Must Remove Duplicates

Failing to resolve duplicate rows leads to:

  • Inflated metrics (e.g., double counting revenue)
  • Incorrect training data for ML models
  • Data sync errors when loading to relational databases
  • Longer query times and higher storage costs

Cleaning duplicates is therefore a critical data-quality step in ETL pipelines, exploratory analysis, and production workloads.

Key pandas Tools

duplicated()

DataFrame.duplicated() returns a Boolean Series marking rows as True if they are duplicates of earlier rows. Useful for inspection, filtering, or conditional logic.

drop_duplicates()

DataFrame.drop_duplicates() removes duplicate rows and returns a new DataFrame (or modifies in place with inplace=True). Core parameters:

  • subset – list of columns to consider when verifying duplicates
  • keep'first' (default), 'last', or False
  • inplace – whether to modify the original DataFrame

Step-by-Step Example

import pandas as pd

data = {
"order_id": [1, 2, 2, 3, 4, 4, 4],
"customer": ["A", "B", "B", "C", "D", "D", "D"],
"paid_usd": [100, 50, 50, 75, 30, 30, 30],
}

df = pd.DataFrame(data)
print("Original\n", df)

# 1. Flag duplicates by all columns
flags = df.duplicated()
print("\nFlags all cols\n", flags)

# 2. Remove duplicates based on all columns, keep first
clean = df.drop_duplicates()
print("\nClean all cols\n", clean)

# 3. Remove duplicates based on order_id only, keep last
clean_last = df.drop_duplicates(subset=["order_id"], keep="last")
print("\nClean on ID keep last\n", clean_last)

Advanced Techniques

Prioritizing with Sorting

If you need the most recent record, sort by updated_at first, then run drop_duplicates(keep='first') to retain the latest version.

GroupBy Aggregation

For summarization rather than row removal, groupby across keys and aggregate.

latest_notes = (
df.sort_values("timestamp")
.groupby("order_id")
.tail(1)
)

Memory-Efficient Deduplication

When dealing with tens of millions of rows, selectively project the key columns to minimize memory usage:

keys = df[["id", "updated_at"]]
mask = keys.duplicated()
subset = df.loc[~mask]

Best Practices

  • Subset Explicitly. Always specify subset for clarity; implicit all-column checks can hide issues.
  • Work on a Copy First. Avoid accidental data loss by testing on a copy before setting inplace=True.
  • Validate Post-Operation. Compare row counts or hash totals to confirm deduplication behaved as expected.
  • Document Business Rules. Decide whether to keep the earliest, latest, or no duplicate and encode that logic clearly.

Common Pitfalls & How to Avoid Them

Pitfall 1: Assuming Index Uniqueness

The pandas index itself can contain duplicates; always verify with index.is_unique.

Pitfall 2: Using keep=False Without Backup

keep=False removes all copies, which can erase desired data. Backup or review flags before committing.

Pitfall 3: Overlooking Data Type Mismatches

Strings like '1' and numbers 1 may look identical but won’t deduplicate. Cast types consistently first.

Relation to SQL Workflows & Galaxy

Although pandas runs in Python, deduplication is often the staging step before loading cleaned data to a relational data warehouse. When that data lands in a database, engineers may query it with a SQL editor such as Galaxy. By resolving duplicates before loading, you prevent downstream analysts from writing unwieldy SELECT DISTINCT statements and keep Galaxy queries lean and performant.

Conclusion

Removing duplicate rows is indispensable to trustworthy analytics. With duplicated(), drop_duplicates(), and a few best practices, pandas makes the task straightforward, scalable, and reproducible.

Why Remove Duplicate Rows in Pandas is important

Duplicate records inflate metrics, corrupt machine-learning training sets, and waste storage. In production ETL pipelines, removing them in pandas ensures that only clean, unique data is analyzed or loaded to downstream systems.

Remove Duplicate Rows in Pandas Example Usage



Common Mistakes

Frequently Asked Questions (FAQs)

How can I list duplicate rows without removing them?

Use df[df.duplicated(subset=cols, keep=False)] to filter all duplicates while preserving the DataFrame unchanged.

What’s the difference between duplicated() and drop_duplicates()?

duplicated() creates a Boolean mask that flags duplicates; drop_duplicates() actually removes them (optionally in place).

Can I remove duplicates based on multiple columns?

Yes. Pass a list to subset, e.g., df.drop_duplicates(subset=["first_name", "last_name"]).

Does Galaxy help with duplicate handling?

Galaxy focuses on SQL editing. While it doesn’t deduplicate pandas DataFrames directly, cleaning data in pandas before loading to a database ensures Galaxy users write simpler, faster SQL queries.

Want to learn about other SQL terms?