This resource introduces data cleaning—why it matters, the steps involved, popular techniques, and common pitfalls. You’ll practice with hands-on SQL examples, learn best practices, and see how Galaxy’s lightning-fast editor and AI Copilot can speed up and standardize your data-cleaning workflow.
Data cleaning (also called data cleansing or scrubbing) is the process of identifying and fixing errors, inconsistencies, or inaccuracies in raw data so it becomes fit for analysis or production use. Think of it as tidying up a messy workshop before starting any serious project—the cleaner your workspace, the better your results.
A repeatable workflow helps you stay organized and avoid re-introducing old errors. A common four-stage framework is:
Start with exploratory queries and summary statistics:
-- Count rows and preview columns
SELECT COUNT(*) AS total_rows FROM raw.customer_events;
-- Null percentage per column
SELECT
SUM(CASE WHEN email IS NULL THEN 1 END) * 100.0 / COUNT(*) AS pct_null_email,
SUM(CASE WHEN signup_date IS NULL THEN 1 END) * 100.0 / COUNT(*) AS pct_null_signup
FROM raw.customer_events;
In Galaxy, you can save these inspection queries to a Collection called “Data Profiling” so your team can reuse them.
IssueSQL FixDuplicate rowsDELETE FROM t USING ( SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM t) dWHERE t.id = d.id AND d.rn > 1;
Inconsistent casingUPDATE users SET email = LOWER(email);
Out-of-range valuesUPDATE orders SET quantity = NULL WHERE quantity < 0;
Galaxy’s AI Copilot can generate or optimize these statements based on your table schema—saving you keystrokes and preventing syntactic errors.
After cleaning, run assertions:
-- Expect no negative quantities
SELECT COUNT(*) AS bad_rows
FROM orders
WHERE quantity < 0;
If bad_rows
returns anything other than 0, your cleaning step failed. Endorse the validation query in Galaxy so it becomes part of your automated checks.
NULLs can arise during ETL failures, manual input errors, or schema changes. Decide whether to delete, impute, or flag them.
-- Replace NULL prices with the median price
UPDATE products
SET price = sub.median_price
FROM (
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM products WHERE price IS NOT NULL) sub
WHERE products.price IS NULL;
They inflate counts and distort aggregates. Use PRIMARY KEY constraints or UNIQUE indexes where possible, and routinely scan for violations.
Dates might appear as “2024-01-03”, “01/03/24”, or “3 Jan 2024”. Standardize early:
UPDATE events
SET event_date = TO_DATE(event_date :: TEXT, 'YYYY-MM-DD')
WHERE event_date ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
Negative ages or 10,000-hour session lengths often point to tracking bugs.
Download ecommerce_dirty.csv (≈5 MB) or import it directly into your database and follow along.
SELECT COUNT(*) total,
SUM(CASE WHEN order_id IS NULL THEN 1 END) AS missing_id,
SUM(CASE WHEN total_amount < 0 THEN 1 END) AS negative_amount
FROM staging.orders;
CREATE TABLE cleaned.orders AS
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM staging.orders) t
WHERE rn = 1;
ALTER TABLE cleaned.orders
ALTER COLUMN order_date TYPE DATE USING order_date::DATE,
ALTER COLUMN total_amount TYPE NUMERIC(12,2);
UPDATE cleaned.orders
SET shipping_state = 'Unknown'
WHERE shipping_state IS NULL;
SELECT COUNT(*) AS invalid_rows
FROM cleaned.orders
WHERE total_amount < 0 OR order_id IS NULL;
If invalid_rows
= 0, congratulate yourself—you just cleaned a dataset! Save the final query in a Galaxy Collection named “🧹 Cleaned Orders”.
Although data cleaning is conceptually tool-agnostic, the right environment can shave hours off your workflow:
order_id
in staging.orders
” and receive an optimized SQL snippet instantly.