Normalizing data in BigQuery scales numeric columns to a common range or distribution, making analyses and ML models more reliable.
Normalization removes scale bias, speeds model convergence, and improves comparability across metrics like price
and total_amount
. It is essential before feeding data into clustering, regression, or classification models.
Subtract the mean and divide by the standard deviation. Resulting values have mean 0 and standard deviation 1.
Rescales values to 0–1 using the column’s minimum and maximum. Useful for bounded algorithms and dashboard metrics.
BigQuery ML automates scaling with feature processors like STANDARD_SCALER
and MIN_MAX_SCALER
, reducing manual SQL.
Use AVG()
and STDDEV()
window functions to compute Z-scores in a single pass, avoiding self-joins.
SELECT
id,
price,
(price - AVG(price) OVER()) / NULLIF(STDDEV(price) OVER(),0) AS price_z
FROM Products;
Calculate the global min and max once, then transform each row. The result fits neatly between 0 and 1.
WITH stats AS (
SELECT MIN(total_amount) AS mn, MAX(total_amount) AS mx FROM Orders
)
SELECT
id,
(total_amount - mn) / NULLIF(mx - mn,0) AS total_amount_norm
FROM Orders, stats;
Pass a struct array describing the scaler. BigQuery creates a normalized view you can reuse in downstream models.
CREATE OR REPLACE TABLE normalized_orders AS
SELECT *
FROM ML.PREPARE_FEATURES(
MODEL `project.dataset.dummy_model`, -- any existing model or temp
(
SELECT order_date, total_amount FROM Orders
),
STRUCT([
STRUCT('total_amount' AS input_feature, 'STANDARD_SCALER' AS transform)
] AS transformations)
);
Persist stats in a helper table for reproducibility, use NULLIF()
to avoid division-by-zero, and document the chosen scaler so downstream users can reverse the transform.
Skipping NULL handling or computing stats on filtered subsets leads to inconsistent scales. Always derive stats from the full training set and keep them fixed for scoring.
No single function exists, but ML.PREPARE_FEATURES offers STANDARD_SCALER and MIN_MAX_SCALER. Otherwise, use window functions or CTEs.
Window functions scan the data once, so performance is usually acceptable. Persisting results in a table avoids repeated computation.
Store the original min/mean and max/stddev values, then multiply and add them back: original = scaled * stddev + mean
for Z-score or scaled * (max-min) + min
for Min-Max.