How to Normalize Data in BigQuery

How do I normalize data in BigQuery?

Normalizing data in BigQuery scales numeric columns to a common range or distribution, making analyses and ML models more reliable.

Welcome to the Galaxy, Guardian!

Oops! Something went wrong while submitting the form.

Description

Why normalize data in BigQuery?

Normalization removes scale bias, speeds model convergence, and improves comparability across metrics like price and total_amount. It is essential before feeding data into clustering, regression, or classification models.

Which normalization techniques are available?

1. Z-score (Standardization)

Subtract the mean and divide by the standard deviation. Resulting values have mean 0 and standard deviation 1.

2. Min-Max scaling

Rescales values to 0–1 using the column’s minimum and maximum. Useful for bounded algorithms and dashboard metrics.

3. ML.PREPARE_FEATURES()

BigQuery ML automates scaling with feature processors like STANDARD_SCALER and MIN_MAX_SCALER, reducing manual SQL.

How do I standardize a column with window functions?

Use AVG() and STDDEV() window functions to compute Z-scores in a single pass, avoiding self-joins.

SELECT id, price, (price - AVG(price) OVER()) / NULLIF(STDDEV(price) OVER(),0) AS price_z FROM Products;

How do I apply Min-Max scaling?

Calculate the global min and max once, then transform each row. The result fits neatly between 0 and 1.

WITH stats AS ( SELECT MIN(total_amount) AS mn, MAX(total_amount) AS mx FROM Orders ) SELECT id, (total_amount - mn) / NULLIF(mx - mn,0) AS total_amount_norm FROM Orders, stats;

How do I normalize features with ML.PREPARE_FEATURES()?

Pass a struct array describing the scaler. BigQuery creates a normalized view you can reuse in downstream models.

CREATE OR REPLACE TABLE normalized_orders AS SELECT * FROM ML.PREPARE_FEATURES( MODEL `project.dataset.dummy_model`, -- any existing model or temp ( SELECT order_date, total_amount FROM Orders ), STRUCT([ STRUCT('total_amount' AS input_feature, 'STANDARD_SCALER' AS transform) ] AS transformations) );

Best practices for BigQuery normalization

Persist stats in a helper table for reproducibility, use NULLIF() to avoid division-by-zero, and document the chosen scaler so downstream users can reverse the transform.

What are common mistakes?

Skipping NULL handling or computing stats on filtered subsets leads to inconsistent scales. Always derive stats from the full training set and keep them fixed for scoring.

Why How to Normalize Data in BigQuery is important

How to Normalize Data in BigQuery Example Usage


-- Normalize customer spending using Z-score and list top outliers
WITH spending AS (
  SELECT customer_id, SUM(total_amount) AS lifetime_spend
  FROM Orders
  GROUP BY customer_id
)
SELECT
  customer_id,
  lifetime_spend,
  (lifetime_spend - AVG(lifetime_spend) OVER()) / STDDEV(lifetime_spend) OVER() AS spend_z
FROM spending
ORDER BY spend_z DESC
LIMIT 10;

How to Normalize Data in BigQuery Syntax


-- Z-score normalization
SELECT
  id,
  (price - AVG(price) OVER()) / NULLIF(STDDEV(price) OVER(),0) AS price_z
FROM Products;

-- Min-Max normalization
WITH stats AS (
  SELECT MIN(total_amount) AS mn, MAX(total_amount) AS mx FROM Orders
)
SELECT
  id,
  (total_amount - mn) / NULLIF(mx - mn,0) AS total_amount_norm
FROM Orders, stats;

-- ML.PREPARE_FEATURES with STANDARD_SCALER
SELECT *
FROM ML.PREPARE_FEATURES(
  MODEL `project.dataset.dummy_model`,
  (SELECT customer_id, total_amount FROM Orders),
  STRUCT([
    STRUCT('total_amount' AS input_feature, 'STANDARD_SCALER' AS transform)
  ] AS transformations)
);

Common Mistakes

Using filtered subsets to compute min, max, or mean. This skews scaling when new rows appear. Fix by calculating stats on the full training set and storing them in a separate table.
Ignoring NULL values, causing division-by-zero or lost rows. Wrap denominators with NULLIF() and COALESCE() or filter out NULLs before scaling.

Frequently Asked Questions (FAQs)

Does BigQuery have a built-in NORMALIZE function?

No single function exists, but ML.PREPARE_FEATURES offers STANDARD_SCALER and MIN_MAX_SCALER. Otherwise, use window functions or CTEs.

Will normalization slow down queries?

Window functions scan the data once, so performance is usually acceptable. Persisting results in a table avoids repeated computation.

How do I reverse the scaling?

Store the original min/mean and max/stddev values, then multiply and add them back: original = scaled * stddev + mean for Z-score or scaled * (max-min) + min for Min-Max.