A step-by-step set of tasks that novices can follow to transform raw data into reliable, analysis-ready datasets.
Cleaning is the unglamorous but indispensable work that turns messy, real-world data into trustworthy information. A data-cleaning checklist gives beginners a repeatable sequence of tasks—validation, formatting, enrichment, and documentation—so they can produce analysis-ready datasets with confidence.
Data-science folklore says that 80% of a project’s time is spent cleaning data. While the precise number is debatable, every experienced analyst agrees on the pain: missing values, inconsistent units, stray characters, duplicate rows—the list goes on. A checklist does three things for beginners:
Before writing a single line of code or SQL, answer:
These questions prevent later surprises like duplicate customer IDs caused by a poorly documented incremental load.
Compare column names, data types, and constraints to source-of-truth documentation (ERDs, API specs, or data contracts). Common tasks:
2024-01-31
) or numbers ('123'
→ 123
).status IN ('pending','paid','failed')
).People love formatting data in creative ways. Bring order:
YYYY-MM-DD
) or your org’s canonical format.UPPER
or lower
), remove control characters.Missingness is information! Determine:
Document the chosen strategy so subsequent analyses don’t unwittingly double-count imputed data.
Use primary keys or composite keys to find exact duplicates. Deduplication strategies:
sum
, avg
).john.doe@gmail.com
vs john_doe@xyz.com
).Outliers can signal data errors or genuine business events. Start by:
Decide whether to cap, remove, or leave extreme values, and justify the choice.
Once the raw data is trustworthy, derive helpful columns:
order_total = quantity * unit_price
signup_to_purchase_days
zip_code
→ state
, country
Feature engineering during cleaning keeps downstream models simpler and more reproducible.
The final (and most skipped) step: write down what you did.
Documentation converts one-off cleaning scripts into institutional knowledge.
CHECK sum(amount) > 0
) so pipelines break loudly.-- Example: Cleaning an orders table
WITH raw AS (
SELECT * FROM staging.orders_2024_05
),
validated AS (
SELECT *
FROM raw
WHERE order_id IS NOT NULL -- Primary key check
AND status IN ('pending','paid','failed')
),
standardized AS (
SELECT
order_id,
customer_id,
CAST(order_date AS DATE) AS order_date,
CAST(unit_price AS NUMERIC) AS unit_price_cents,
quantity,
UPPER(TRIM(currency)) AS currency,
created_at,
updated_at
FROM validated
),
deduped AS (
SELECT DISTINCT ON (order_id) *
FROM standardized
ORDER BY order_id, updated_at DESC
)
SELECT
*,
quantity * unit_price_cents AS order_total_cents
FROM deduped;
This query walks through several checklist steps: validation, standardization, deduplication, and feature derivation.
Galaxy’s modern SQL editor accelerates checklist execution:
While Galaxy doesn’t clean data for you, it removes friction from writing, sharing, and iterating on cleaning code.
Why it’s wrong: Silent casting may truncate data (e.g., '123.45'
→ 123
).
Fix: Explicitly cast and add TRY_CAST
checks to detect failures.
Why it’s wrong: Removing all but one record may erase legitimate distinct rows.
Fix: Deduplicate by the true primary key (e.g., order_id
and item_id
).
Why it’s wrong: A null revenue means “unknown,” while zero revenue means “free.”
Fix: Use CASE statements or separate flags to distinguish.
Start every data project with the checklist in mind. Automate each step, version-control your code, and document decisions. You’ll gain cleaner inputs, faster iterations, and far fewer late-night bug hunts.
Without a repeatable cleaning process, analysts waste hours debugging bad joins, developers ship code on faulty assumptions, and executives make decisions on misleading metrics. A checklist brings discipline and transparency, ensuring every stakeholder can trust the numbers driving products and strategy.
Yes, but with pragmatism. For a quick ad-hoc analysis you can combine steps (e.g., validate and standardize together), yet skipping any entirely risks hidden errors.
Whenever the source data refreshes, schemas change, or business rules evolve. Automating the pipeline to run on each data load is the safest approach.
Galaxy’s SQL editor and AI copilot accelerate writing, testing, and sharing cleaning queries. While Galaxy doesn’t execute ETL jobs, it streamlines the collaborative SQL work that underpins a robust cleaning process.
Spreadsheet functions, OpenRefine, and low-code ETL platforms offer visual interfaces. However, learning basic SQL or Python quickly pays dividends in automation and scalability.