SMOTE (Synthetic Minority Over-sampling Technique) is an algorithm that creates synthetic examples of the minority class to balance imbalanced datasets for machine-learning models.
Class imbalance occurs when the target variable in a classification task contains a significantly higher number of observations in one class than in the others. For example, a fraud-detection dataset might contain 0.3 % fraudulent transactions and 99.7 % legitimate ones. Traditional machine-learning algorithms assume roughly equal class distributions and tend to be biased toward the majority class, leading to poor recall for the minority class.
SMOTE—short for Synthetic Minority Over-sampling Technique—is a resampling method that synthetically generates new minority-class samples instead of simply duplicating existing ones. By interpolating between a minority sample and its k nearest minority neighbors, SMOTE produces realistic but slightly varied data points that help the learner better understand the minority decision region.
1. Select a minority sample.
2. Find its k nearest minority neighbors (often k=5).
3. Randomly choose one neighbor.
4. Synthesize a new sample along the line segment that joins the two samples, using linear interpolation.
5. Repeat until the desired class balance is achieved.
Borderline-SMOTE Focuses on samples near the decision boundary.
SMOTE-Tomek Combines SMOTE with Tomek links for cleaning overlapping examples.
SMOTE-ENN Applies Edited Nearest Neighbors after SMOTE to remove noisy points.
ADASYN Adaptive approach that creates more samples where the class imbalance is most severe.
Ignoring imbalance leads to misleading metrics. A model that predicts “not fraud” 100 % of the time achieves 99.7 % accuracy but 0 % recall on the minority class. In regulated industries—healthcare, finance, cybersecurity—missing rare but crucial events is unacceptable. Techniques like SMOTE directly address this issue, improving recall and F1 scores without sacrificing too much precision.
Install the imbalanced-learn
library, which sits on top of scikit-learn:
pip install imbalanced-learn scikit-learn
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.05, 0.95], n_informative=3,
n_redundant=1, flip_y=0, n_features=20,
n_clusters_per_class=1, n_samples=5000,
random_state=42)
print('Original class distribution:', Counter(y))
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
print('Resampled class distribution:', Counter(y_res))
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
stratify=y,
random_state=42)
pipe = Pipeline(steps=[('smote', SMOTE(random_state=42)),
('clf', RandomForestClassifier(random_state=42))])
param_grid = {
'clf__n_estimators': [200, 400],
'clf__max_depth': [None, 20]
}
model = GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Always compare metrics before and after resampling. Focus on recall, precision-recall AUC, and the F1 score rather than accuracy alone. Cross-validate within a pipeline so that SMOTE is applied only to training folds, avoiding data leakage.
1. Resample after train-test split to keep the test set representative.
2. Use pipelines to avoid leakage during cross-validation.
3. Try multiple ratios; sometimes a slight imbalance gives better precision.
4. Couple SMOTE with under-sampling of the majority class to control dataset size.
5. Monitor overfitting; synthetic samples can make minority decision regions unrealistically smooth.
This leaks synthetic minority information into the test set, inflating performance metrics.
Generating a perfectly balanced dataset may hurt precision—experiment with sampling_strategy
.
Synthesizing samples in noisy regions amplifies errors. Consider Borderline-SMOTE or SMOTE-ENN.
In production at many fintech companies, SMOTE (or its variants) is applied within model-training pipelines to ensure that rare fraud patterns are captured. Teams monitor the precision-recall curve in live A/B tests to confirm that the uplift in recall justifies any increase in false positives.
If your minority class has categorical, non-ordinal features with few levels, SMOTE may create implausible samples. In these cases, try:
SMOTE is a powerful, easy-to-implement technique that often yields dramatic improvements in recall for imbalanced datasets. When combined with proper cross-validation and evaluation metrics, it helps data engineers and data scientists build more robust, fair, and production-ready models.
Class-imbalanced datasets are common in fraud, medical diagnosis, and anomaly detection. Failing to address imbalance skews model training, resulting in high accuracy but poor minority-class recall. SMOTE offers a controlled, reproducible way to amplify minority patterns, improving model fairness and business value while integrating seamlessly with the scikit-learn ecosystem used across data engineering and analytics workflows.
Yes. The implementation in imbalanced-learn
supports multi-class resampling by generating synthetic points for each minority class relative to the majority class.
Class-weighting modifies the loss function, while SMOTE alters the data distribution. In practice, using both together—SMOTE inside a pipeline and class-weighted algorithms—often yields the best balance between recall and precision.
Standard SMOTE operates in continuous feature space. For categorical or mixed data, use SMOTENC
in imbalanced-learn
, which handles categorical indices explicitly.
It can, especially with high sampling rates or noisy data. Mitigate by combining SMOTE with under-sampling, using cross-validated pipelines, and monitoring validation-set metrics closely.