Beginners Resources

What Is Data Cleaning? A Beginner’s Guide

Welcome to Galaxy!
You'll be receiving a confirmation email.

In the meantime, follow us on Twitter
Oops! Something went wrong while submitting the form.

This resource introduces data cleaning—why it matters, the steps involved, popular techniques, and common pitfalls. You’ll practice with hands-on SQL examples, learn best practices, and see how Galaxy’s lightning-fast editor and AI Copilot can speed up and standardize your data-cleaning workflow.

Table of Contents

Learning Objectives

  • Understand what data cleaning is and why it is critical for any data workflow.
  • Identify common data quality issues such as duplicates, nulls, and inconsistent formats.
  • Learn the typical end-to-end data-cleaning process—inspection, transformation, validation, and documentation.
  • Apply hands-on SQL techniques to clean real-world datasets.
  • Discover best practices and tools—including how Galaxy’s features simplify cleaning at scale.

1. What Is Data Cleaning?

Data cleaning (also called data cleansing or scrubbing) is the process of identifying and fixing errors, inconsistencies, or inaccuracies in raw data so it becomes fit for analysis or production use. Think of it as tidying up a messy workshop before starting any serious project—the cleaner your workspace, the better your results.

1.1 Why It Matters

  • Accuracy: Dirty data leads to misleading metrics, faulty machine-learning models, and bad business decisions.
  • Trust: Stakeholders lose confidence when numbers keep changing due to hidden issues.
  • Cost: Gartner estimates that poor-quality data costs organizations an average of $12.9M annually.

2. The Data-Cleaning Workflow

A repeatable workflow helps you stay organized and avoid re-introducing old errors. A common four-stage framework is:

  1. Inspection: Explore and profile the raw data to locate problems.
  2. Transformation: Apply fixes—e.g., type casting, trimming whitespace, standardizing dates.
  3. Validation: Verify that transformations achieved the desired quality goals.
  4. Documentation: Record what was done so teammates (and your future self) understand the lineage.

2.1 Inspection Techniques

Start with exploratory queries and summary statistics:

-- Count rows and preview columns
SELECT COUNT(*) AS total_rows FROM raw.customer_events;

-- Null percentage per column
SELECT
SUM(CASE WHEN email IS NULL THEN 1 END) * 100.0 / COUNT(*) AS pct_null_email,
SUM(CASE WHEN signup_date IS NULL THEN 1 END) * 100.0 / COUNT(*) AS pct_null_signup
FROM raw.customer_events;

In Galaxy, you can save these inspection queries to a Collection called “Data Profiling” so your team can reuse them.

2.2 Transformation Techniques

IssueSQL FixDuplicate rowsDELETE FROM t USING ( SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM t) dWHERE t.id = d.id AND d.rn > 1;Inconsistent casingUPDATE users SET email = LOWER(email);Out-of-range valuesUPDATE orders SET quantity = NULL WHERE quantity < 0;

Galaxy’s AI Copilot can generate or optimize these statements based on your table schema—saving you keystrokes and preventing syntactic errors.

2.3 Validation Checks

After cleaning, run assertions:

-- Expect no negative quantities
SELECT COUNT(*) AS bad_rows
FROM orders
WHERE quantity < 0;

If bad_rows returns anything other than 0, your cleaning step failed. Endorse the validation query in Galaxy so it becomes part of your automated checks.

3. Common Data Quality Problems

3.1 Missing Values

NULLs can arise during ETL failures, manual input errors, or schema changes. Decide whether to delete, impute, or flag them.

-- Replace NULL prices with the median price
UPDATE products
SET price = sub.median_price
FROM (
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM products WHERE price IS NOT NULL) sub
WHERE products.price IS NULL;

3.2 Duplicates

They inflate counts and distort aggregates. Use PRIMARY KEY constraints or UNIQUE indexes where possible, and routinely scan for violations.

3.3 Inconsistent Formats

Dates might appear as “2024-01-03”, “01/03/24”, or “3 Jan 2024”. Standardize early:

UPDATE events
SET event_date = TO_DATE(event_date :: TEXT, 'YYYY-MM-DD')
WHERE event_date ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';

3.4 Outliers & Impossible Values

Negative ages or 10,000-hour session lengths often point to tracking bugs.

4. Hands-On Example: Cleaning an E-commerce Dataset

Download ecommerce_dirty.csv (≈5 MB) or import it directly into your database and follow along.

Step 1 — Profile

SELECT COUNT(*) total,
SUM(CASE WHEN order_id IS NULL THEN 1 END) AS missing_id,
SUM(CASE WHEN total_amount < 0 THEN 1 END) AS negative_amount
FROM staging.orders;

Step 2 — Remove Duplicates

CREATE TABLE cleaned.orders AS
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM staging.orders) t
WHERE rn = 1;

Step 3 — Fix Data Types

ALTER TABLE cleaned.orders
ALTER COLUMN order_date TYPE DATE USING order_date::DATE,
ALTER COLUMN total_amount TYPE NUMERIC(12,2);

Step 4 — Handle Nulls

UPDATE cleaned.orders
SET shipping_state = 'Unknown'
WHERE shipping_state IS NULL;

Step 5 — Validate

SELECT COUNT(*) AS invalid_rows
FROM cleaned.orders
WHERE total_amount < 0 OR order_id IS NULL;

If invalid_rows = 0, congratulate yourself—you just cleaned a dataset! Save the final query in a Galaxy Collection named “🧹 Cleaned Orders”.

5. Best Practices & Tips

  • Automate where possible: Use dbt tests, scheduled SQL jobs, or CI pipelines to rerun cleaning steps on new data.
  • Keep raw data immutable: Always write cleaned records to a separate schema/table.
  • Log assumptions: Document why you chose to drop vs. impute certain records.
  • Version your queries: Galaxy stores version history so you can revert or compare changes.
  • Leverage constraints: PRIMARY KEY, FOREIGN KEY, CHECK constraints act as first-line defenses.

6. How Galaxy Supercharges Data Cleaning

Although data cleaning is conceptually tool-agnostic, the right environment can shave hours off your workflow:

  1. AI Copilot. Ask “Remove duplicates by order_id in staging.orders” and receive an optimized SQL snippet instantly.
  2. Collections & Endorsements. Curate reusable cleaning templates your entire team can trust.
  3. Schema-aware autocomplete. Avoid typos that introduce new errors.
  4. Version & Govern. Track every edit to ensure that cleaning logic remains transparent and auditable.

7. Key Takeaways

  • Data cleaning transforms raw, messy data into reliable assets—its absence can invalidate entire projects.
  • Follow a disciplined loop: inspection → transformation → validation → documentation.
  • Common issues include missing values, duplicates, inconsistent formats, and outliers.
  • SQL remains a powerful, declarative language for cleaning—especially when paired with Galaxy’s modern editor and AI features.
  • Document and automate your process to maintain quality as data scales.

Next Steps

  1. Import your own dataset into Galaxy and run the inspection queries.
  2. Create a Collection called “Data-Cleaning Templates”.
  3. Use AI Copilot to refactor one of your existing cleaning scripts.
  4. Endorse validated queries so teammates can reuse them.
  5. Explore dbt or Airflow to schedule recurring cleaning jobs.

Check out some other beginners resources