Imputing missing values in SQL involves replacing NULLs with statistically or contextually appropriate substitutes to preserve data integrity and analytical quality.
Learn why NULL handling matters, how to choose the right replacement strategy, and the SQL patterns, functions, and workflows that keep your data sets healthy.
Relational databases represent unknown or unavailable information with the NULL
literal. Unlike zero or an empty string, NULL
means “value does not exist.” NULLs propagate through calculations and are filtered out by most aggregate functions unless explicitly addressed. Because many analytical algorithms and BI tools assume complete data, unresolved NULLs can distort results, produce errors, or hide critical patterns.
There are two primary ways to deal with missing data: delete the affected rows or impute new values. Deleting works only when the proportion of missing data is negligible and random. In most production systems, however, removing rows lowers statistical power and may bias results. Imputation—replacing NULLs with calculated substitutes—preserves row count and often improves model performance and user experience.
Deterministic approaches use constants or easily derived values:
Statistical approaches rely on summary metrics of existing data:
SELECT customer_id,
COALESCE(phone_number, 'Unknown') AS phone_number
FROM customers;
COALESCE
returns the first non-NULL argument. It is ANSI SQL and optimized in all major engines.
SELECT order_id,
CASE WHEN discount IS NULL AND total > 100 THEN 0.05
WHEN discount IS NULL THEN 0.00
ELSE discount END AS discount
FROM orders;
Use CASE
when you need rule-based imputations that depend on other columns.
SELECT customer_id,
purchase_date,
COALESCE(spend,
AVG(spend) OVER (PARTITION BY customer_segment)) AS spend
FROM transactions;
The windowed average is calculated per segment, delivering context-aware replacements without extra joins.
WITH segment_means AS (
SELECT customer_segment,
AVG(spend) AS segment_avg
FROM transactions
WHERE spend IS NOT NULL
GROUP BY customer_segment
)
SELECT t.order_id,
COALESCE(t.spend, s.segment_avg) AS spend
FROM transactions t
LEFT JOIN segment_means s
ON t.customer_segment = s.customer_segment;
CTEs keep logic modular, version-controlled, and readable—critical when multiple teams collaborate in Galaxy Collections.
COUNT(*)
and COUNT(column)
to measure NULL density.You have a sensor_readings
table with intermittent NULL temperatures. Engineers decided to replace NULLs with the previous non-NULL value (last observation carried forward) when readings are within the same day.
SELECT reading_id,
COALESCE(temperature,
LAST_VALUE(temperature IGNORE NULLS)
OVER (PARTITION BY sensor_id, date_trunc('day', reading_time)
ORDER BY reading_time
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING))
AS temperature_filled
FROM sensor_readings;
Not all databases support IGNORE NULLS
; for PostgreSQL, you can emulate with MAX()
and cumulative windows.
Galaxy’s AI copilot understands context. Type “fill NULL prices with median per product line”, and the copilot autogenerates a window-function query. Afterwards, you can store it in a Collection named imputation_patterns
and let teammates endorse it. Version history ensures every refinement is auditable, eliminating copy-pasted SQL in Slack.
Never run UPDATE
statements on the primary table unless mandated by governance. Instead, create views or materialized tables so you can revert if needed.
Replacing outlier-heavy columns with a mean can skew analytics. Use the median or trimmed mean for skewed distributions.
Imputing 0
across numeric columns may conflate NULL
with legitimate zeros. Track an additional Boolean flag or use distinct sentinel values.
AVG
, STDDEV
, PERCENTILE
).SQL offers powerful, composable tools to replace missing data without exporting to Python or R. By mastering COALESCE
, CASE
, window functions, and CTEs—and by operationalizing them in a developer-friendly SQL editor like Galaxy—you can ensure your analyses remain reliable and reproducible.
Missing values can break joins, bias aggregations, and crash machine-learning pipelines. Addressing them directly in SQL lets data engineers maintain data quality closest to the source, reducing data movement, duplicate logic, and downstream re-work.
Deletion reduces sample size and can introduce bias when data are not missing completely at random (MCAR). Imputation retains more information and often improves statistical power.
If the column is normally distributed without outliers, the mean is fine. For skewed data or small sample sizes, the median is more robust and less sensitive to extreme values.
Galaxy’s AI copilot can generate context-aware imputation SQL, while Collections let teams store and endorse standardized patterns. The desktop IDE ensures fast iteration and version control.
Performance differences are negligible. Choose COALESCE
for simplicity when you need the first non-NULL value; use CASE
when logic depends on multiple conditions.