Linear regression in SQL refers to performing ordinary least squares modeling directly inside relational databases using SQL queries and built-in analytical functions.
Linear regression is one of the most widely used statistical techniques for predicting a continuous target based on one or more explanatory variables. While data scientists often perform regression in Python or R, modern analytical databases expose functions—such as LINEAR_REG_SLOPE
, window aggregates, and matrix utilities—that let you fit models in-database with plain SQL. This article explains why, when, and how to run linear regression in SQL, outlines best practices, and shows complete working examples.
Moving large datasets out of your warehouse to a notebook can be slow, costly, and insecure. Executing regression where the data already lives avoids serialization overhead and governance concerns.
Distributed warehouses (Snowflake, BigQuery, Redshift, DuckDB, PostgreSQL with Citus, etc.) parallelize aggregations and matrix math, allowing regressions on billions of rows in seconds.
Storing both feature engineering and model training logic in version-controlled SQL keeps analytics and predictive steps in one place—easy to review, schedule, and re-run.
Several databases expose scalar or aggregate functions that calculate coefficients directly.
ML.EXPLAIN_LINEAR_REG
, ML.PREDICT
LINEAR_REG_SLOPE
, LINEAR_REG_INTERCEPT
, LINEAR_REG_R2
regr_slope
, regr_intercept
These functions implement Ordinary Least Squares (OLS) under the hood. You supply a target and predictor; the database returns the best-fit coefficients.
If your engine lacks high-level regression helpers, you can compute normal equations (XᵀX and Xᵀy) with aggregations, then solve for β using matrix inversion utilities. DuckDB’s row_number()
+ array_agg()
combo is a popular pattern.
Tools like Vertica’s in-database machine learning or PostgreSQL’s MADlib ship pre-built procedures (madlib.linregr_train()
) that accept a table name and output a model artifact.
Assume we have a sales
table:
CREATE OR REPLACE TABLE sales (
day DATE,
ads_spend NUMBER(12,2),
revenue NUMBER(12,2)
);
Goal: predict revenue
from ads_spend
.
WITH stats AS (
SELECT
LINEAR_REG_SLOPE(revenue, ads_spend) AS slope,
LINEAR_REG_INTERCEPT(revenue, ads_spend) AS intercept,
LINEAR_REG_R2(revenue, ads_spend) AS r_squared
FROM sales
)
SELECT * FROM stats;
The query returns:
slope
– how much revenue increases per $1 in ad spendintercept
– baseline revenue with $0 spendr_squared
– goodness-of-fit (0-1)WITH coeffs AS (
SELECT
LINEAR_REG_SLOPE(revenue, ads_spend) AS m,
LINEAR_REG_INTERCEPT(revenue, ads_spend) AS b
FROM sales
)
SELECT
s.*, -- original row
(c.m * ads_spend + c.b) AS predicted_revenue,
(revenue - (c.m * ads_spend + c.b)) AS residual
FROM sales s
CROSS JOIN coeffs c;
Because everything is a SQL statement, you can chain this logic into dashboards, ETL, or Galaxy Collections for team reuse.
Databases supporting matrix operations allow multiple predictors. For instance, BigQuery ML:
CREATE OR REPLACE MODEL `my_project.my_dataset.sales_model`
OPTIONS(model_type = 'linear_reg') AS
SELECT revenue AS label,
ads_spend,
site_visits,
promo_discount
FROM `my_project.my_dataset.sales`;
Once trained, call ML.PREDICT
to score new data:
SELECT * FROM ML.PREDICT(MODEL `my_project.my_dataset.sales_model`,
(SELECT * FROM `my_project.my_dataset.future_spend`));
RANDOM()
or TABLESAMPLE
to create hold-out sets and avoid overfitting.models.linear_regression
table for lineage and auditing.WHERE date >= CURRENT_DATE - INTERVAL '90' DAY
).Galaxy’s context-aware AI copilot auto-completes analytical functions like LINEAR_REG_SLOPE
and can refactor queries when your column names change. Store your regression script in a Galaxy Collection, endorse it, and teammates can reuse the vetted model logic without copy-pasting SQL in Slack.
If you need:
then export a sample to Python/R. But start with SQL for transparency and speed.
Yes—you can (and often should) run linear regression directly in SQL. With built-in analytical functions, you avoid data egress, exploit the power of MPP engines, and maintain auditable pipelines. Whether you are fitting a single-feature trend line or a multivariate sales forecast, today’s SQL engines—and developer-focused editors like Galaxy—make in-database modeling both practical and performant.
Data engineers increasingly own machine-learning pipelines. Keeping simple predictive models inside the data warehouse reduces infrastructure complexity, speeds iteration, and preserves data governance. SQL-native regression empowers engineers to deliver quick business insights without switching tools or languages.
Yes. Both rely on ordinary least squares. Differences stem from numerical precision, but results match to at least 6–8 decimal places on real-world datasets.
Snowflake, BigQuery, Redshift, Vertica, DuckDB, Trino, and Postgres extensions like MADlib provide OLS utilities.
Galaxy’s AI copilot autocompletes regression functions, suggests feature columns, and lets teams endorse the final query in a shared Collection for easy reuse.
Most warehouses have task schedulers (e.g., Snowflake Tasks). Wrap your SQL in a stored procedure or script and trigger nightly; Galaxy’s workflow roadmap will streamline this further.