How to Train a Classifier on Heavily Imbalanced Data

May 31, 2028 · 16 min read

The situation

A payments team has trained an XGBoost classifier on 8 million transactions from the past year. Each row has 40 features; the target is a binary “was this transaction later charged back as fraud”. The training set contains ~32,000 fraud cases and ~7.97 million legitimate cases, a 0.4% positive rate.

The trained model:

Reports 99.3% accuracy on the test set.
Catches 2% of actual fraud (recall on the positive class ≈ 0.02).
When it does predict fraud, it’s correct 88% of the time (precision ≈ 0.88).

The business reality:

Each missed fraud costs ~$140 on average (the chargeback plus operational cost).
Each false positive costs ~$8 (a legitimate transaction blocked, customer annoyed, support call).
The bank wants to catch at least 60% of fraud. The current 2% is not shippable.

The team’s first instinct is “train longer”, “try a neural network”, “add more features”. None of these will help much. The problem isn’t the model, it’s that the training signal for the minority class is too small for any algorithm to learn well. Imbalanced classification has its own toolkit, and picking from it is the work.

What actually matters

Class imbalance shows up when the target distribution is uneven enough that naive training optimises for the majority class. The pathology:

A loss function like log-loss or accuracy is dominated by the 99.6% of examples that are negative. Getting them correct buys more loss reduction than getting the 0.4% positives correct.
The gradient signal for the minority class is small. A tree boosting algorithm that splits on the overall loss will find splits that separate majority points better than it finds splits that isolate minority points.
The decision threshold (typically 0.5 for probabilistic classifiers) is calibrated to accuracy, which rewards “predict majority always”.

There are five families of techniques. They’re not mutually exclusive; production systems typically combine two or three.

1. Resampling the training data. Change the distribution the model sees, not the model itself.

Random oversampling: duplicate minority-class rows until the two classes are balanced (or some target ratio). Simple, risks overfitting because the same rows appear many times.
Random undersampling: drop majority-class rows until balanced. Throws away data, which hurts when the majority class has its own complexity to learn.
SMOTE (Synthetic Minority Over-sampling Technique): generate synthetic minority examples by interpolating between existing ones in feature space. More robust than duplication but assumes the feature space is meaningful for interpolation (numeric features, no leakage from categorical embeddings).
ADASYN: variant of SMOTE that oversamples more aggressively near decision boundaries.

2. Reweighting the loss. Keep the training data but make the loss care more about minority-class errors.

Class weights: most frameworks accept a class_weight or scale_pos_weight parameter that multiplies the loss contribution of minority-class examples. XGBoost’s scale_pos_weight is the canonical example; scikit-learn’s class_weight='balanced' is another.
Focal loss: a loss function that down-weights easy examples (the majority of the majority class) and up-weights hard ones, including minority cases. Originally for object detection; applies to any binary classification. Requires implementing the loss, not a parameter flip.

3. Threshold tuning. Leave the model as is; change where the predicted probability becomes a “yes”.

The default threshold of 0.5 optimises for accuracy. For imbalanced problems, the threshold that maximises business value is usually much lower (e.g. 0.15), catching more positives at the cost of more false positives. The calibration is a simple sweep against the business cost function.
Requires the model to produce well-calibrated probabilities. Tree-based models often need Platt scaling or isotonic regression to calibrate.

4. Ensemble methods specifically for imbalance.

BalancedRandomForestClassifier / EasyEnsemble: build each tree on a balanced bootstrap sample, averaging across trees that each saw a different slice of the majority class. Preserves all minority data; uses all majority data across the ensemble.
Cost-sensitive boosting: variants of AdaBoost that weight misclassifications by their business cost. Available in some libraries; conceptually similar to class weighting.

5. Anomaly / outlier methods instead of classification. When the positive class is truly rare (<0.1%), reframing the problem as “is this point unusual?” rather than “is this point fraud?” can work better. Isolation Forest, One-Class SVM, Random Cut Forest (SageMaker built-in) are the usual tools. They learn a density model of normal data and flag deviations. Useful when the minority class is both rare and heterogeneous.

The interesting question for this team is which of these fits 0.4% imbalance, 8M rows, tree-based training, and asymmetric misclassification costs.

What we’ll filter on

Works with tree-based models (XGBoost/LightGBM), or does the technique require a specific model family?
Preserves or generates majority and minority data, throwing away majority data is expensive when we have it.
Handles 0.4% imbalance specifically, some techniques break down at extreme ratios.
Incorporates business cost asymmetry, $140 for a miss vs. $8 for a false positive.
Implementation complexity, a flag flip vs. a custom loss vs. a new training pipeline.

The class-imbalance landscape

1. scale_pos_weight in XGBoost (class weighting). One parameter. Set to (count_neg / count_pos) ≈ 249 for a 0.4% positive rate. The per-example loss for positives is multiplied by this factor. Cheap, works well for moderate imbalance, can over-correct for extreme imbalance.

2. Random undersampling. Keep all ~32,000 fraud cases; randomly drop majority cases until the ratio is, say, 1:10. Now training on ~350,000 rows instead of 8M. Fast to train, loses information. Reasonable baseline but rarely the final answer.

3. SMOTE + oversampling. Use imbalanced-learn to generate additional fraud examples via k-NN interpolation until the ratio is 1:10 or 1:5. Preserves all majority data, augments minority data with structured variations. Requires a preprocessing step; works best with numeric features. Risk: synthetic examples may bleed across the true decision boundary if the feature space is discontinuous.

4. Balanced Random Forest / EasyEnsemble. An ensemble where each tree trains on a different balanced subsample. Built into imbalanced-learn as a drop-in for scikit-learn’s RandomForestClassifier. Competitive with XGBoost on many problems; the ensemble structure gives robustness.

5. Threshold tuning. Keep the model; sweep thresholds from 0.01 to 0.5 against a business-cost function: cost = miss_cost * FN + fp_cost * FP. Pick the threshold that minimises total cost on a validation set. Should always be done regardless of other choices.

6. Focal loss. Implement in XGBoost via a custom objective function (XGBoost supports this natively). More complex but particularly good when the minority class is heterogeneous and there are “easy” positives and “hard” positives. Overkill for most problems.

7. Anomaly detection (Isolation Forest / Random Cut Forest). Reframe: train on majority-class only, score each transaction by its anomaly likelihood. Useful when the positive class is both rare and doesn’t form clean clusters. Loses the precise labelling signal; trades calibration for robustness.

Side by side

Technique	Works with XGBoost	Preserves data	Handles 0.4%	Cost asymmetry	Complexity
`scale_pos_weight`	✓ native	✓	Yes, with care	partial (symmetric penalty)	Very low
Random undersampling	✓	✗ (drops majority)	✓	✗ (needs pairing)	Low
SMOTE + oversampling	✓ (preprocess)	✓ (generates)	✓	✗ (needs pairing)	Medium
Balanced Random Forest	RF-based	✓	✓	partial	Medium
Threshold tuning	✓	✓	✓	✓ (direct)	Very low
Focal loss	✓ custom objective	✓	✓	✓ (via focusing)	High
Anomaly detection	✗ (different model)	✓ (majority-only)	✓	Indirect	Medium

Reading by what the team needs:

Base level: scale_pos_weight in XGBoost to balance the gradient signal + threshold tuning against the business cost function. Two small changes, probably 70% of the win.
Next level: combine with SMOTE to augment the minority class with synthetic examples. Gets another 10-20% on recall typically.
Diminishing returns: focal loss for specific edge cases; anomaly detection as a parallel signal, not a replacement.

Visualising the rebalance

Four views of the same data. Naive training produces a curve with a useless operating point. Class weighting and SMOTE move the achievable performance up; threshold tuning against the business cost function moves the operating point to where it belongs.

The pick in depth

Combine scale_pos_weight + threshold tuning as the baseline; add SMOTE if needed.

The training script change:

from xgboost import XGBClassifier

neg, pos = y_train.value_counts()
scale = neg / pos  # ~ 249

clf = XGBClassifier(
    objective='binary:logistic',
    eval_metric='aucpr',              # PR-AUC not log-loss; reflects imbalance
    scale_pos_weight=scale,
    max_depth=6,
    n_estimators=500,
    learning_rate=0.05,
    tree_method='hist',
    early_stopping_rounds=20,
)
clf.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Two changes:

scale_pos_weight=249 multiplies the loss contribution of fraud examples by 249, rebalancing the gradient signal.
eval_metric='aucpr' uses precision-recall AUC rather than log-loss, which is the correct metric under imbalance (ROC-AUC is misleading because the false-positive denominator is huge).

That moves recall from 0.02 to ~0.55 at threshold 0.5. Now threshold-tune against the business cost:

probs = clf.predict_proba(X_val)[:, 1]
thresholds = np.linspace(0.01, 0.99, 99)

costs = []
for t in thresholds:
    pred = (probs > t).astype(int)
    fn = ((pred == 0) & (y_val == 1)).sum()
    fp = ((pred == 1) & (y_val == 0)).sum()
    costs.append(140 * fn + 8 * fp)

optimal_t = thresholds[np.argmin(costs)]  # ~0.12

Setting the decision threshold to 0.12 (rather than 0.5) lifts recall to ~0.72 at the cost of precision dropping to ~0.29. The business cost function says this is strictly better: the $140-per-miss makes missing fraud very expensive, the $8-per-false-positive makes over-flagging cheap, and the optimal operating point reflects that.

If the team still isn’t at 60% recall, SMOTE is the next lever. Add from imblearn.over_sampling import SMOTE; X_res, y_res = SMOTE(sampling_strategy=0.2, random_state=42).fit_resample(X_train, y_train) before training, oversampling the minority to 20% of the majority. In SageMaker, SMOTE lives in a Processing Job that runs on X_train.parquet and emits the resampled dataset to a new S3 prefix.

If the team still isn’t there, the next steps are: richer features (especially features engineered by fraud analysts, like velocity checks), a separate model for specific fraud types, or adding an anomaly-detection signal as a feature in the main model (the Isolation Forest’s score as an input feature of XGBoost). But the first three techniques – scale_pos_weight, threshold tuning, SMOTE, usually get a team from “2% recall” to “usable”.

A worked training iteration

Monday morning, the team’s v1 model sits in Model Registry as fraud-v1 with recall 0.02. They plan five iterations in parallel:

v2: baseline + scale_pos_weight=249 + eval_metric=aucpr. Training takes 18 minutes. Results: recall 0.55 at threshold 0.5, PR-AUC 0.52.
v3: v2 + threshold tuning on validation set against the business cost function. Recall 0.72 at threshold 0.12, daily cost saving estimated $11.4K.
v4: v3 + SMOTE oversampling to 20% minority. Training takes 28 minutes. Recall 0.76 at the tuned threshold, precision 0.27. Daily cost saving estimated $12.1K.
v5: v4 + focal loss via custom objective. Training takes 45 minutes, requires code changes. Recall 0.77, precision 0.28. Marginal improvement; not worth the complexity.
v6: ensemble v3 + an Isolation Forest score as a feature. Recall 0.81, precision 0.31. Daily cost saving estimated $13.2K.

Team ships v4 (simple, well-understood, meets the 60% requirement) to production via SageMaker Pipelines with blue/green deployment. v6 goes into a research backlog; v5 is archived.

Model Monitor compares the production model’s rolling daily fraud-catch rate against the v1 baseline; within two weeks the business sees the $11K/day saving materialise on the weekly ops report.

What’s worth remembering

Accuracy is the wrong metric for imbalanced problems. Predicting the majority class always gives high accuracy and catches nothing. Use PR-AUC, F1 on the minority class, or, best, a business cost function. eval_metric='aucpr' in XGBoost swaps in the correct metric.
scale_pos_weight rebalances the gradient. Set it to n_neg / n_pos; XGBoost multiplies the loss contribution of positive examples by that factor. Single parameter, enormous lift.
Threshold tuning is a business choice. The 0.5 default optimises accuracy. The optimal threshold for a business cost function is usually much lower when misses cost more than false positives. Always do the sweep; pick the minimum-cost point.
SMOTE generates minority examples. Better than duplication for most tabular problems; worse when the feature space has discontinuities (categorical embeddings, sparse features). Try it; measure it.
Random undersampling throws away information. Fast baseline, rarely the final answer. Useful only when training speed is the binding constraint.
Balanced Random Forest is a solid alternative. Ensemble methods that build each tree on a balanced subsample; imblearn has the implementation. Competitive with XGBoost + weighting on many problems.
Anomaly detection is a different framing. Isolation Forest, One-Class SVM, Random Cut Forest train on majority-class only and score by unusualness. Useful as a signal or a fallback when the positive class is too heterogeneous to model directly.
The techniques compose. scale_pos_weight + SMOTE + threshold tuning stacks cleanly. Focal loss and ensemble methods are more involved but still compatible. Start simple, measure each change, keep the ones that help.

Class imbalance is a question about where the signal lives in the data. The naive loss function doesn’t see the signal because the majority class drowns it out. Every technique in the toolkit is a different way of saying “pay attention to the minority class”, by over-representing it (SMOTE), by weighting it more (scale_pos_weight), by scoring it differently (threshold tuning), or by reframing the question (anomaly detection). Two or three of them combined usually turn a 2%-recall model into a 70%-recall model, which is the difference between a number in a slide deck and a shipped system.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.