The situation
A data science team at a subscription-box company has three modelling problems sitting on the backlog:
- Fraud rate prediction per product line, quarterly. ~40 product lines, ~600,000 historical transactions with ~60 features per transaction (dollar amount, chargeback history, payment-method, delivery address age, IP reputation, and so on). Target is a binary outcome aggregated to a rate. The finance team wants it quarterly and wants to know which features drive the rate.
- Daily demand forecast for 4,000 SKUs, 90 days ahead. Historical sales data goes back three years at daily granularity. Strong weekly and yearly seasonality. New SKUs are added monthly; they have short histories but share families with older SKUs.
- Ad-spend-to-conversion attribution ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. for the marketing team. ~15 ad channels, historical spend and conversions, plus promotions and holidays as exogenous signals. Marketing specifically wants to know “if we shift $10K from Instagram to YouTube next week, what conversions do we expect?”, a counterfactual question. They want feature importances and they want inference in seconds, not hours.
Each problem has a SageMaker built-in algorithm that looks like a fit on name alone. The shopping list includes XGBoost, LightGBM, and DeepAR, with at least three more (Linear Learner, Factorisation Machines, Random Cut Forest) worth naming to rule out. The question isn’t which single algorithm is “best”; it’s which matches each problem’s shape.
What actually matters
Before reaching for a built-in container, it’s worth naming what these algorithms actually do and where they diverge.
XGBoost is a gradient-boosted tree ensemble. It builds shallow decision trees one at a time, each tree correcting the residuals of the ensemble so far. The input is a tabular dataset: rows of features, a target column, usually numeric but with categorical features one-hot encoded or target encoded. Works on classification, regression, and ranking. Outputs predictions with per-feature importance scores (gain, cover, frequency) and supports SHAP values for per-row explanations. SageMaker ships XGBoost 1.5 through 1.7 and the newer 2.x. Industry default for tabular problems; if someone is starting with “I have a CSV”, XGBoost is the baseline they should beat.
LightGBM is also a gradient-boosted tree ensemble, but with different mechanical choices: leaf-wise tree growth instead of level-wise (deeper asymmetric trees, faster convergence on large datasets), histogram-based feature bucketing (less memory, faster TrainingThe process of fitting a model’s weights to data by minimising a loss function. ), and native handling of categorical features without one-hot encoding. The output shape (predictions, feature importances, SHAP support) is the same as XGBoost. Usually two or three times faster to train on large datasets, sometimes slightly lower accuracy on small ones. Arrived in SageMaker’s catalogue as a built-in a few years after XGBoost; for many teams the two are interchangeable baselines and the choice comes down to which is tuned better at that point.
DeepAR is fundamentally different. A recurrent neural network (LSTM-based) designed specifically for probabilistic time-series forecasting across many related series. The input is not a tabular row but a collection of time series (thousands of them) with optional per-series and per-time-step covariates. DeepAR learns a single global model that generalises across all the series, which is why it handles the “new SKU with a short history” case: the new series inherits patterns learned from its cousins in the same product family. The output is not a point prediction but a full predictive distribution (quantiles), which is what inventory planning actually wants; “demand is 120 with 80% confidence interval 85-160” is more useful than “demand is 120”. Does not do classification or regression in the general tabular sense; it’s a time-series specialist.
The problems divide cleanly:
- Fraud rate prediction with feature importance is a tabular classification / regression problem. Either XGBoost or LightGBM fits.
- Daily SKU demand across 4,000 series with yearly seasonality and new-SKU cold start is a many-related-time-series problem. DeepAR.
- Ad-spend attribution with counterfactual “what if” queries is a tabular regression with causal interpretation. XGBoost or LightGBM, probably with SHAP for the feature-importance story, probably not DeepAR (attribution is more about relationships between features than the temporal shape of a single one).
What we’ll filter on
- Problem type: tabular classification/regression, time-series forecasting, ranking, clustering, anomaly detection?
- Handles many related series natively: does the algorithm learn one model across thousands of series, or one per series?
- Categorical features: does it handle them natively or do we one-hot encode?
- Explainability: feature importances and SHAP support out of the box?
- Output shape: point predictions, probabilities, or full distributions?
The algorithm landscape
1. XGBoost. Gradient-boosted trees for tabular data. Classification, regression, ranking. Extensively tuned, well-documented, stable; probably the most common first model on any tabular problem in the industry. Handles missing values natively. Categorical features need encoding. Supports early stopping, tree_method=hist for large datasets, and GPU training. In SageMaker, both as a built-in container (managed framework container) and as a script-mode framework image we can bring our own Python training script for.
2. LightGBM. Gradient-boosted trees with different internals: leaf-wise growth, histogram splits, native categorical support. Faster on large datasets; competitive accuracy with XGBoost after tuning. Same output shape (predictions, feature importances, SHAP). In SageMaker as a built-in algorithm since 2022, also available via bring-your-own.
3. DeepAR. Autoregressive RNN for probabilistic forecasting across related time series. One model learns patterns across thousands of series simultaneously; new series with short histories inherit patterns from their covariate cohort. Outputs quantile forecasts. Handles per-series static features (e.g. product category) and per-timestep dynamic covariates (e.g. promotion flags, holidays). Does not do classification or general regression.
4. Linear Learner. Generalised linear models at scale: logistic, softmax, linear regression. Fast on very large tabular datasets; interpretable coefficients. Typically outperformed by boosted trees on structured problems with non-linear relationships, but worth considering when explainability needs are extreme (the coefficient table is the explanation) or when training time is the binding constraint. Worth naming to rule out for the three problems here.
5. Factorisation Machines. Linear models with pairwise feature interactions, popular for recommendation and high-dimensional sparse problems (click-through rate on sparse user × item matrices). Wrong shape for all three problems above.
6. Random Cut Forest (RCF). Unsupervised anomaly detection via random-split forests. Useful when the question is “is this row unusual compared to the rest?” rather than “predict a target”. Fits the real-time fraud case (scoring individual transactions) better than the quarterly rate prediction; a different problem in the fraud family.
7. Prophet / NeuralProphet / BYO ARIMA. Not SageMaker built-ins; available via bring-your-own Python containers. Reasonable baselines for single-series or small-series-count forecasting. For 4,000 SKUs, training one model per series is a scaling problem DeepAR is designed to avoid.
Side by side
| Algorithm | Problem type | Many related series | Categorical features | Explainability | Output shape |
|---|---|---|---|---|---|
| XGBoost | Tabular class/reg/rank | ✗ | Encode first | ✓ (gain, SHAP) | Point prediction |
| LightGBM | Tabular class/reg/rank | ✗ | ✓ native | ✓ (gain, SHAP) | Point prediction |
| DeepAR | Time-series forecasting | ✓ (designed for it) | ✓ via static features | Limited | Quantile forecast |
| Linear Learner | Tabular class/reg | ✗ | Encode first | ✓ (coefficients) | Point/prob |
| Factorisation Machines | Sparse pairs / recsys | ✗ | ✓ sparse | ✗ | Point prediction |
| Random Cut Forest | Anomaly detection | ✗ | Encode first | Limited | Anomaly score |
Reading by problem:
- Fraud rate per product line, quarterly, with feature importance: XGBoost or LightGBM. Tabular, needs explainability, 600K rows is comfortable for either. LightGBM wins on training speed with 60 features and several categorical fields; XGBoost wins on community-tuned hyperparameters and the team’s muscle memory. Either will hit the accuracy target; pick the one the team knows better or both in parallel and race.
- Daily demand, 4,000 SKUs, 90 days ahead, new SKUs cold-start: DeepAR. No other built-in is designed for this scale of related series with cold-start handling.
- Ad-spend attribution with counterfactual queries and fast inference: XGBoost or LightGBM with SHAP. The counterfactual question (“what if we shift spend?”) is a sensitivity analysis on a tabular model, which both trees handle. DeepAR’s strength (learning across many series) isn’t what attribution needs; attribution needs interpretable relationships between features.
The decision in pictures
The picks in depth
Fraud rate → LightGBM (with XGBoost as the baseline-to-beat). The training set fits comfortably in memory, the feature set has a mix of numeric and categorical variables, and finance wants the feature-importance report. LightGBM’s native categorical handling means we can train on payment_method, product_line, ip_country without one-hot encoding them into 200 columns, which is both faster to train and often more accurate. SageMaker’s LightGBM built-in ships as a managed container; training takes input from S3 in CSV or libsvm format, supports hyperparameter tuning through SageMaker Automatic Model Tuning, and saves the resulting model artifact back to S3. Feature importance comes out via model.feature_importance(importance_type='gain'); SHAP values via the shap Python library on the trained model.
The workflow: train LightGBM and XGBoost in parallel on the same dataset with the same tuning budget (say, 50 trials each, Bayesian search). Whichever wins on held-out AUC by a meaningful margin ships; if they’re within 0.5%, pick the one that trains faster. Most teams find LightGBM wins on this shape of dataset by a narrow margin and trains in 1/3 the time, but running both is cheap insurance.
SKU demand → DeepAR. The training data is 4,000 time series × ~1,100 days each ≈ 4.4M rows. DeepAR trains one global model across all of them. The input format is JSON Lines: one object per series with start (ISO timestamp), target (array of observations), optional cat (static categorical feature, e.g. product family ID), and optional dynamic_feat (array of dynamic covariates, e.g. promotion flags aligned with the target). The trained model does probabilistic inference: for a given series, it returns samples from the forecast distribution, which we summarise as p10/p50/p90 quantiles.
The cold-start handling is the specific reason to pick DeepAR over “run 4,000 ARIMA models”. A new SKU added this month has 30 days of history; DeepAR learned from 4,000 other SKUs how product-family members behave, so the forecast for the new one uses the learned family shape blended with whatever signal the 30 days provide. Training hyperparameters: context_length around 60 days, prediction_length 90 days, num_cells 40, num_layers 2, likelihood student-T (longer tails than Gaussian, which fits retail demand). Training on a single ml.g5.xlarge takes about 90 minutes.
Ad attribution → XGBoost + SHAP. The problem is small (15 channels × ~150 weekly rows × a handful of exogenous features ≈ 2,250 rows) but the demand is high on explainability. XGBoost trains in seconds, SHAP gives per-row attributions, and the “what if we shift $10K” question is answered by a sensitivity analysis: change the spend columns and re-predict, or (more correctly, because attribution has a causal structure) use SHAP interaction values to decompose the contribution of each channel to the predicted conversions.
DeepAR could be applied here if we framed it as one time series per channel; it would be the wrong tool. The question isn’t “what will conversions be next week?” (which DeepAR does well), it’s “how much of this week’s conversions came from Instagram?”, which is a feature-attribution question on a structured model. The answer shape determines the algorithm.
A worked DeepAR request
A daily cron triggers the weekly forecast refresh at 02:00 UTC. The pipeline:
- Prep: a Glue job reads the previous day’s sales and emits JSON Lines for each SKU:
{"start": "2024-10-01", "target": [4, 2, 5, ...], "cat": [12], "dynamic_feat": [[0,0,1,...]]}The
dynamic_featcaptures a binary promotion flag per day. - Train: SageMaker training job on
ml.g5.xlarge, input from S3, output artifact to S3. ~90 minutes for 4,000 series × 1,100 days. - Batch transform: for inference, a batch job reads the most recent
context_lengthdays of each series plus the planneddynamic_featfor the next 90 days, and writes quantile forecasts to S3. - Load: the forecasts land in a Redshift table; inventory planning consumes them via scheduled queries.
Inventory then places restock orders sized to the p90 rather than the p50, so they’re buying enough to meet the 90th-percentile demand rather than being caught short half the time. The probabilistic output is the thing that makes that decision possible; point forecasts force a “safety stock multiplier” kludge that the team previously tuned by hand.
What’s worth remembering
- Tree models (XGBoost, LightGBM) are for tabular problems. Classification, regression, ranking. Point predictions with feature importance and SHAP support. Baseline-to-beat for anything in rows-and-columns shape.
- DeepAR is for many related time series. One global model learns across thousands of series, handles cold start for new series via covariate cohorts, and returns probabilistic quantile forecasts. Not a classification or regression tool.
- XGBoost vs LightGBM is usually a wash on accuracy. LightGBM is faster to train on large datasets with categorical features (native handling, histogram splits). XGBoost has deeper community tuning and is the safer default when the team is new to tree models. Race them in parallel; pick whichever wins.
- Output shape often decides the algorithm. If the business needs a full forecast distribution (quantiles, confidence bands), DeepAR is the only built-in that natively returns one. If the business needs feature attributions, trees + SHAP are the straightforward answer.
- Don’t use DeepAR on 15 time series. Its strength is generalising across many. A single series with clear seasonality is better served by classical forecasting or a tree on lagged features; DeepAR’s RNN machinery is overkill and often worse without enough series to learn from.
- Tabular + feature importance = trees. Don’t pick Linear Learner for explainability unless coefficients are specifically what the stakeholders want. Trees with SHAP give per-row explanations that linear coefficients can’t.
- Random Cut Forest, Factorisation Machines, Linear Learner are adjacent. RCF for unsupervised anomaly detection; FM for high-cardinality sparse pairwise problems (recsys); Linear Learner for very large linear-enough problems. None of the three problems above wants any of these as the primary choice.
- SageMaker built-ins are managed containers. Same as any other framework training job: input from S3, artifact to S3, hyperparameters as JSON. Script-mode lets us write our own Python if the built-in’s defaults constrain us; most teams stay on the built-in through the first year.
Three problems, three algorithm shapes, and only one of them needs the neural network. Trees handle the tabular work with feature importance for free; DeepAR earns its place specifically when the thing we’re predicting is a shape through time, and there are many of those shapes to learn from at once. The interesting decisions are in the shape of the problem, not the novelty of the algorithm.