This resource introduces predictive analytics from first principles. You’ll learn how historical data is transformed into models that forecast future outcomes, explore common algorithms, walk through end-to-end examples in SQL and Python, and discover how Galaxy can accelerate your experimentation and collaboration.
Companies sit on mountains of historical data—transactions, user events, sensor readings, support tickets, and more. Predictive analytics turns that data into foresight by training statistical or machine-learning models that estimate what is likely to happen next. Whether it’s predicting customer churn, estimating demand, or flagging fraudulent payments, the goal is actionable probability: a quantitative signal that drives proactive decisions.
Predictive analytics is the bridge between descriptive hindsight and prescriptive action.
FeaturesInput variables—columns in your dataset—that the model uses to learn patterns.Target (Label)The outcome you want to predict (e.g., churn = 1
).Training vs. Test DataTraining data teaches the model; test data evaluates its generalization.ModelA mathematical function mapping features to predicted targets.Evaluation MetricA quantitative score (AUC, RMSE, etc.) indicating predictive quality.
Binary classification; outputs probability between 0 and 1. Great for churn or fraud flags.
Non-linear, handle categorical variables well, robust to outliers.
Specialized models for temporal data like revenue or sensor readings.
Ensemble technique that often tops Kaggle leaderboards; good balance of accuracy and speed.
Estimated time: 15 minutes
# 1. Load libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
# 2. Load dataset (CSV or Galaxy-exported query)
churn = pd.read_csv('telecom_churn.csv')
# 3. Feature engineering
features = ['tenure', 'monthly_charges', 'contract_type', 'support_tickets']
X = pd.get_dummies(churn[features], drop_first=True)
y = churn['churn']
# 4. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5. Scale numeric features
scaler = StandardScaler()
X_train[X_train.columns] = scaler.fit_transform(X_train)
X_test[X_test.columns] = scaler.transform(X_test)
# 6. Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# 7. Evaluate
pred_proba = model.predict_proba(X_test)[:, 1]
print('AUC:', roc_auc_score(y_test, pred_proba))
Exercise: Add internet_service
and payment_method
as new features. Does the AUC improve?
You can build simple predictive signals directly in SQL, especially for time-series forecasts.
-- Predict next 7-day order count using 4-week rolling average
WITH orders AS (
SELECT order_date::date AS d, COUNT(*) AS orders
FROM ecommerce.orders
GROUP BY 1
),
roll_avg AS (
SELECT d,
AVG(orders) OVER (ORDER BY d ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING) AS avg_4w
FROM orders
)
SELECT d,
avg_4w AS predicted_orders
FROM roll_avg
ORDER BY d DESC
LIMIT 14;
Paste this query into Galaxy, save it in a Forecasts Collection, and endorse it so teammates can pull predictions reliably. Galaxy’s AI copilot can suggest window-function syntax and auto-complete your table names from metadata.
From Galaxy, click Run ▸ Export ▸ Notebook
to open the result set in a Jupyter notebook for further visualization.
MistakeWhy It HappensFixUsing future data in trainingTemporal joins leak informationLag features; split chronologicallyOverfittingToo complex modelCross-validate; regularizeIgnoring class imbalanceRare events like fraudUse balanced metrics; oversample minority class