Stacking & Blending
Stacked Generalization and Holdout-Based Ensembling
"Don't just average your experts. Train a meta-expert to know when each expert is right."
1. What Are Stacking and Blending?
Stacking (Stacked Generalization) is an ensemble method introduced by Wolpert (1992) that trains a meta-learner to optimally combine the predictions of multiple base learners. Unlike voting (which uses a fixed combination rule — average or majority), stacking learns the combination from data.
Blending is a simpler variant of stacking that uses a single holdout set for meta-training instead of cross-validated out-of-fold predictions.
Both share the same philosophy: instead of asking "what does each classifier predict?", ask "given what each classifier predicts, what should the final answer be?" The meta-learner learns to weight each base classifier differently for different regions of the input space — using classifiers that are locally strong while discounting ones that are locally weak.
| Property | Stacking | Blending |
|---|---|---|
| Training method | K-fold cross-validation (OOF) | Fixed holdout set |
| Data efficiency | Uses all data for base learners | Sacrifices holdout data (10–20%) |
| Overfitting risk | Low (OOF predictions are honest) | Low-medium (holdout not seen by base) |
| Computation | K × base learner training | 1 × base learner training |
| Leakage risk | None (OOF prevents leakage) | None (holdout not used for base) |
| sklearn class | StackingClassifier |
Manual implementation |
| Complexity | High | Medium |
2. The Core Idea — Learning to Combine
Voting ensemble: Fixed combination rule — take the average or majority.
F_voting(x) = (1/B) Σₜ ĥₜ(x) ← fixed, no learning
Stacking: Learned combination — train a meta-learner on the base learners' predictions.
F_stacking(x) = meta_learner( ĥ₁(x), ĥ₂(x), ..., ĥ_B(x) ) ← learned from data
The meta-learner sees what each base learner predicts and learns:
- Which classifier is more accurate on what type of example
- How to combine them to correct for each other's systematic errors
- Non-linear interactions between classifiers' predictions
The key insight: If Classifier A is consistently better on class 0 and Classifier B is consistently better on class 1, a meta-learner can learn this and weight them accordingly — something a fixed average cannot do.
3. Stacking — Full Algorithm
3.1 Cross-Validated Out-of-Fold Predictions
The fundamental challenge in stacking: the meta-learner must train on the base learners' predictions — but if the base learners saw the meta-training data, their predictions would be overfit (optimistic). This is called target leakage.
Solution: Out-of-Fold (OOF) predictions via K-fold cross-validation.
Given: Training data D = {(x₁,y₁),...,(xₘ,yₘ)}, K folds, B base learners
Phase 1 — Generate OOF meta-features:
Split D into K folds: D₁, D₂, ..., D_K
OOF_predictions = empty array of shape (m, B × K_or_1_proba_cols)
For b = 1 to B:
For k = 1 to K:
Train base_b on D \ D_k (all data except fold k)
Predict on D_k:
OOF_predictions[D_k, b] = base_b.predict_proba(D_k)
# Now OOF_predictions[i, b] = prediction of base_b for sample i
# when base_b was NOT trained on sample i ← honest, no leakage
Phase 2 — Train final base learners:
For b = 1 to B:
base_b_final = train base_b on full D
# This version is used for test-time prediction
Phase 3 — Train meta-learner:
meta_learner.fit(OOF_predictions, y)
Phase 4 — Prediction on test data:
For each test sample x:
base_preds = [base_b_final.predict_proba(x) for b in 1..B]
final_pred = meta_learner.predict(base_preds)
3.2 Why Out-of-Fold Is Essential
The leakage problem without OOF:
WRONG approach:
1. Train all base learners on D
2. Get their predictions on D (in-sample predictions)
3. Train meta-learner on these predictions
Problem: Base learners have memorized D → in-sample predictions are near-perfect
→ Meta-learner learns to trust base learners unconditionally
→ On test data, base learners are less accurate → meta-learner fails
Why OOF predictions are the correct solution:
For each sample i, OOF prediction is generated by a base learner that was not trained on sample i. The OOF prediction for sample i approximates what the base learner would predict on a genuinely unseen sample — the same distribution as test predictions.
The meta-learner therefore trains on predictions that have the same statistical properties as test-time predictions. No leakage, no optimistic bias.
Formal guarantee: OOF predictions are asymptotically equivalent to using an infinite number of training samples — they converge to the true out-of-distribution prediction as K → ∞ (leave-one-out cross-validation).
3.3 Training the Meta-Learner
Meta-features: The matrix fed to the meta-learner, shape (m, B × num_proba_cols):
For binary classification with B base learners using predict_proba:
Meta-features per sample i:
[P̂₁(y=0|xᵢ), P̂₁(y=1|xᵢ), P̂₂(y=0|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=0|xᵢ), P̂_B(y=1|xᵢ)]
shape: (m, 2B)
Or using only the positive class probabilities (common for binary):
[P̂₁(y=1|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=1|xᵢ)]
shape: (m, B)
Choice of meta-learner: The meta-learner trains on these meta-features:
meta_learner.fit(OOF_predictions, y_train)
Common choices:
- Logistic Regression (most common): Simple, regularized, interpretable. Learns linear combination of base predictions.
- Ridge/LASSO: Automatically selects useful base learners (LASSO zeros out weak ones).
- Gradient Boosting: Can learn non-linear combinations — very powerful but needs careful regularization.
- Neural Network: Ultimate flexibility — but strong regularization needed (early stopping, dropout).
- XGBoost/LightGBM: Strong non-linear meta-learner, works well in competitions.
3.4 Multi-Level Stacking
Stack multiple layers — the output of one stacking layer becomes the input to the next:
Layer 1 (base learners): LR, RF, GBT, SVM, kNN → OOF predictions₁
Layer 2 (stacking): XGBoost, LightGBM trained on OOF predictions₁ → OOF predictions₂
Layer 3 (meta-learner): Logistic Regression trained on OOF predictions₂
Does multi-level stacking help?
Empirically: yes, but with diminishing returns. Layer 1 → Layer 2 can provide meaningful improvement on competitive tasks. Layer 2 → Layer 3 often provides marginal gains and risks overfitting. More than 3 levels rarely helps.
The risk: Each additional layer requires another round of OOF generation. Computational cost multiplies, and the meta-features from deeper layers have lower effective training set size.
4. Blending — The Holdout Variant
4.1 The Blending Algorithm
Blending is a simpler but slightly less data-efficient alternative to stacking:
Given: Training data D, holdout fraction h (e.g., 0.2)
Phase 1 — Split:
D_train = (1-h) fraction of D
D_holdout = h fraction of D
Phase 2 — Train base learners on D_train:
For b = 1 to B:
base_b.fit(D_train)
Phase 3 — Generate meta-features on D_holdout:
meta_features_holdout = [base_b.predict_proba(D_holdout) for b in 1..B]
Phase 4 — Train meta-learner on holdout predictions:
meta_learner.fit(meta_features_holdout, y_holdout)
Phase 5 — Retrain base learners on all data (optional):
For b = 1 to B:
base_b_final.fit(D) # Use all data for final base learners
Phase 6 — Test prediction:
For each test sample x:
base_preds = [base_b_final.predict_proba(x) for b in 1..B]
final_pred = meta_learner.predict(base_preds)
4.2 Blending vs. Stacking — Full Comparison
| Property | Stacking (OOF) | Blending (Holdout) |
|---|---|---|
| Meta-training data size | Full training set (m samples) | Holdout only (h·m samples) |
| Base learner training data | Full training set | Reduced (1-h) fraction |
| Leakage risk | None (OOF guarantee) | None (holdout not in base training) |
| Computation | K × B base fits + 1 meta fit | B base fits + 1 meta fit |
| Reproducibility | Fully reproducible (CV splits) | Depends on holdout split |
| Statistical efficiency | Higher (uses all data) | Lower (wastes h fraction) |
| Risk of meta overfit | Lower (more meta-training data) | Higher (smaller meta-training set) |
| Implementation complexity | High | Low |
| sklearn support | ✅ StackingClassifier | ❌ Manual only |
| Best use case | When data is limited | When data is plentiful / fast iteration |
When to use blending:
- Data is large and losing 20% for meta-training is acceptable
- Fast iteration is needed (K-fold stacking takes K× longer)
- Competition setting where implementation speed matters
- Prototyping before committing to full stacking
5. What the Meta-Learner Learns
With logistic regression as the meta-learner:
P(y=1 | x) = sigmoid( w₁·P̂₁(y=1|x) + w₂·P̂₂(y=1|x) + ... + w_B·P̂_B(y=1|x) + b )
The learned weights wₜ reveal:
- Which base classifiers are trusted: Large |wₜ| → trusted; small |wₜ| → not trusted
- Whether a classifier hurts: wₜ < 0 → this classifier's positive prediction is evidence against y=1 (inverse correlation with correctness)
- Relative contributions: The ratio of weights shows the relative trust
With gradient boosting as the meta-learner, it can learn:
"If classifier A predicts y=1 with high confidence AND classifier B predicts y=0,
then the correct answer is more likely y=0 because B is reliable in this region"
This conditional combination is impossible with a fixed voting rule.
Examining the meta-learner:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
clf = StackingClassifier(estimators, final_estimator=LogisticRegression())
clf.fit(X_train, y_train)
# For logistic regression meta-learner
meta_lr = clf.final_estimator_
print("Meta-learner coefficients (base learner weights):")
for (name, _), coef in zip(estimators, meta_lr.coef_[0]):
print(f" {name}: {coef:.4f}")
6. Choosing Base Learners
Cardinal rule: maximize diversity, maintain quality.
Diversity strategies:
- Algorithm diversity: LR + RF + GBT + SVM + kNN — each has different inductive biases
- Hyperparameter diversity: RF(n_trees=100) + RF(n_trees=500, deeper) — different regularization
- Feature subsets: Some classifiers see features A–G, others see features H–P
- Data transformation diversity: One RF on raw features, one on PCA-transformed features
- Scale diversity: One SVM on scaled features, one RF on original features
Quality threshold:
from sklearn.model_selection import cross_val_score
import numpy as np
# Only include classifiers above baseline
baseline_auc = 0.75 # Minimum acceptable AUC for inclusion
selected = []
for name, clf in candidates:
cv_auc = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc').mean()
if cv_auc >= baseline_auc:
selected.append((name, clf))
print(f"✓ {name}: AUC = {cv_auc:.4f}")
else:
print(f"✗ {name}: AUC = {cv_auc:.4f} — below threshold")
What NOT to include:
- Models that are clearly worse than random
- Models that are near-identical to existing ensemble members (high correlation)
- Models so computationally expensive that they slow down the pipeline prohibitively
7. Choosing the Meta-Learner
Logistic Regression (most recommended):
from sklearn.linear_model import LogisticRegression
meta = LogisticRegression(C=0.1, max_iter=1000)
# C=0.1 (strong regularization) — prevents meta-learner from overfitting
Why logistic regression is the standard choice:
- Transparent weights show which base classifiers are trusted
- Strong regularization (C < 1) prevents meta-overfitting
- Very fast to train on the meta-feature matrix
- Less prone to overfitting than complex meta-learners
- Theoretical justification via Super Learning (see Section 12)
Ridge Classifier:
from sklearn.linear_model import RidgeClassifier
meta = RidgeClassifier(alpha=10.0)
Forces a smooth linear combination — useful when OOF meta-features are limited.
Gradient Boosting as meta-learner (aggressive):
from sklearn.ensemble import GradientBoostingClassifier
meta = GradientBoostingClassifier(n_estimators=100, max_depth=2, learning_rate=0.05)
Can capture non-linear combinations — more powerful but needs careful regularization (shallow trees, low learning rate) to avoid overfitting the small meta-training set.
LightGBM as meta-learner (competition setting):
import lightgbm as lgb
meta = lgb.LGBMClassifier(n_estimators=200, num_leaves=15, learning_rate=0.05)
Very fast, handles many meta-features, good regularization.
Rule: The meta-learner should be simpler and more regularized than the base learners. The base learners do the heavy lifting; the meta-learner learns to trust them appropriately.
8. Feature Engineering for Stacking
The meta-learner receives base learner predictions as input. You can enrich this with:
Option 1: Probabilities only (standard)
meta_features = [P̂₁(y=1|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=1|xᵢ)]
Option 2: Probabilities + original features (passthrough)
# sklearn's StackingClassifier supports this with passthrough=True
clf = StackingClassifier(estimators, final_estimator=meta_lr, passthrough=True)
With passthrough=True, the meta-learner receives both the base predictions AND the original features. This allows the meta-learner to learn "in this region of feature space, trust classifier A more."
Option 3: Predictions + rank features
import numpy as np
from scipy.stats import rankdata
# Rank predictions across samples
meta_rank_features = np.column_stack([
rankdata(oof_pred) / len(oof_pred)
for oof_pred in oof_predictions.T
])
# Rank-transformed features are uniform [0,1] — helps regularize
Option 4: Prediction uncertainty features
# Entropy of each classifier's probability vector
meta_entropy = np.column_stack([
-np.sum(proba * np.log(proba + 1e-10), axis=1)
for proba in base_proba_matrices
])
High entropy → classifier is uncertain → meta-learner can downweight it.
9. StackingClassifier in sklearn — Full API
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
estimators = [
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(n_estimators=300, n_jobs=-1)),
('gbt', GradientBoostingClassifier(n_estimators=200)),
('svm', SVC(probability=True))
]
clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(C=0.1), # Meta-learner
cv=5, # K in K-fold cross-validation for OOF
stack_method='predict_proba', # 'predict_proba', 'decision_function', 'predict'
n_jobs=-1,
passthrough=False, # If True, pass original features to meta-learner too
verbose=0
)
clf.fit(X_train, y_train)
# Predictions
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)
# Access components
clf.estimators_ # List of fitted base estimators (fitted on full training data)
clf.named_estimators_ # Dict: name → fitted estimator
clf.final_estimator_ # Fitted meta-learner
# OOF predictions used to train meta-learner (not stored by default)
# To access: use cross_val_predict manually
Getting OOF predictions manually:
from sklearn.model_selection import cross_val_predict
import numpy as np
# Generate OOF predictions for each base classifier
oof_predictions = np.column_stack([
cross_val_predict(clf, X_train, y_train, cv=5, method='predict_proba')[:, 1]
for _, clf in estimators
])
# shape: (m, B)
# Analyze OOF performance
from sklearn.metrics import roc_auc_score
for (name, _), oof_pred in zip(estimators, oof_predictions.T):
auc = roc_auc_score(y_train, oof_pred)
print(f"{name} OOF AUC: {auc:.4f}")
10. Stacking for Regression
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
estimators = [
('rf', RandomForestRegressor(n_estimators=200, n_jobs=-1)),
('gbt', GradientBoostingRegressor(n_estimators=200, learning_rate=0.05)),
('svr', SVR(kernel='rbf', C=1.0))
]
reg = StackingRegressor(
estimators=estimators,
final_estimator=Ridge(alpha=1.0),
cv=5,
n_jobs=-1,
passthrough=False
)
reg.fit(X_train, y_train)
Meta-learner for regression:
- Ridge regression (most common — prevents meta-overfitting)
- ElasticNet (automatic base learner selection via L1)
- LightGBM (if non-linear combination is needed)
11. The Bias-Variance Profile
Stacking improves over individual base learners by reducing both bias and variance:
Bias reduction: The meta-learner can correct systematic errors of individual base classifiers. If RF consistently overestimates class 1 probability by 0.05, the meta-learner learns to discount RF's positive predictions.
Variance reduction: The meta-learner averages over multiple predictions — the variance of the combination is lower than any individual component.
The improvement depends on:
Diversity: Higher diversity → larger bias reduction
Individual quality: Higher individual accuracy → better meta-features → better meta-model
Meta-learning: More regularized meta-learner → lower variance of combination
Practical expectation: Stacking over 5 diverse classifiers typically improves AUC by 2–5% over the best single classifier — roughly 2× the gain from voting.
12. Theoretical Justification — Super Learning
Stacking is a specific case of Super Learning (van der Laan, Polley, Hubbard, 2007) — a theoretically grounded ensemble method with optimality guarantees.
The Oracle Inequality for Super Learning:
Let L* = risk of the oracle (best possible predictor). Let L_SL = risk of the super learner. Then:
E[L_SL] ≤ E[L_oracle] + C · (log B) / n
Where C is a constant and n is the sample size. As n → ∞:
L_SL / L_oracle → 1 (asymptotic optimality)
The super learner converges to the oracle predictor — the best possible predictor achievable from the given hypothesis class — at a rate that is only logarithmically worse.
Practical implication: With sufficient data, stacking will find the best linear combination of the given base classifiers — it cannot do worse than any individual base classifier (the meta-learner can zero-weight poor classifiers).
Connection to cross-validated model selection: Model selection (picking the single best classifier by CV) is equivalent to stacking with a 0-1 weight vector. Stacking generalizes model selection by finding optimal non-integer weights.
13. Assumptions
| Assumption | Notes |
|---|---|
| IID samples | Required for OOF cross-validation validity |
| Exchangeability of folds | OOF predictions are only "honest" if folds are exchangeable |
| No temporal structure | For time-series data, use time-based CV for OOF generation |
| Sufficient meta-training data | Meta-learner needs enough OOF predictions to learn from |
| Base learner quality | At least some base learners must be better than random |
| Calibrated base probabilities | For effective meta-learning — miscalibrated probs mislead meta |
| No target leakage | OOF ensures this; blending ensures it via holdout |
14. Advantages
✅ Learns Optimal Combination
Unlike voting (fixed combination), stacking learns which classifiers to trust and when — exploiting conditional expertise.
✅ Asymptotically Optimal (Super Learning)
Oracle inequality guarantees convergence to the best possible predictor from the candidate set.
✅ Never Worse Than Best Base Learner (with regularized meta)
A ridge-penalized logistic regression meta-learner can always zero-weight poor classifiers, falling back to the best individual model.
✅ Captures Non-Linear Combinations (with powerful meta-learner)
GBT or neural network meta-learner can learn "in this feature region, use RF; in that region, use SVM" — impossible with voting.
✅ 2–5× Greater Improvement Than Voting
Empirically outperforms voting ensembles in competitions and benchmarks.
✅ Naturally Handles Model Uncertainty
The meta-learner's probability outputs reflect the collective uncertainty of all base models.
✅ Feature Passthrough for Context-Aware Combination
passthrough=True allows the meta-learner to use original features for context — "in this region of feature space, trust these classifiers more."
15. Drawbacks & Limitations
❌ Computationally Expensive
K-fold stacking with B base learners and K=5 requires 5B + B + 1 total model fits. For expensive base learners (large random forests, SVMs), this multiplies the training time by K.
❌ Leakage Risk Without OOF
Fitting base learners on training data and then using their in-sample predictions to train the meta-learner is a classic data leakage pattern. Requires careful implementation.
❌ Small Datasets Problematic
With m < 1000 samples, 5-fold CV leaves only 800 samples to train each base learner — suboptimal base learner performance → suboptimal meta-features.
❌ Time-Series/Ordered Data Requires Careful CV
Standard K-fold CV shuffles samples randomly — invalid for time series (future leaks into the past). Must use TimeSeriesSplit or expanding window CV.
❌ Meta-Learner Can Overfit
If the meta-learner is too complex (deep GBT with many trees on small meta-feature matrix), it overfits the OOF predictions. Always regularize the meta-learner more aggressively than the base learners.
❌ Complex Pipeline Maintenance
A stacking pipeline with 5 base learners, OOF generation, and a meta-learner has many moving parts. Production deployment requires careful engineering.
❌ Interpretability Loss
The two-level architecture makes explanation harder than any single model or simple voting ensemble.
16. Stacking vs. Voting vs. Bagging vs. Boosting
| Property | Stacking | Voting | Bagging | Boosting |
|---|---|---|---|---|
| Learns combination | ✅ Yes | ❌ Fixed rule | ❌ Fixed average | ❌ Fixed (αₜ weights) |
| Base learner types | Heterogeneous | Heterogeneous | Homogeneous | Homogeneous |
| Training order | Parallel (base) + Sequential (meta) | Parallel | Parallel | Sequential |
| Error type targeted | Both bias + variance | Variance | Variance | Bias |
| Typical gain over best | 2–5% | 1–3% | 5–20% | Varies (high) |
| Overfitting risk | Low (with OOF) | Low | Very low | Medium |
| Implementation complexity | High | Low | Low | Medium |
| sklearn support | ✅ StackingClassifier | ✅ VotingClassifier | ✅ BaggingClassifier | ✅ AdaBoost/GBT |
17. Practical Tips & Gotchas
Basic Stacking Setup
from sklearn.ensemble import (StackingClassifier, RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
estimators = [
('rf', RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42)),
('gbt', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=42)),
('svm', SVC(probability=True, kernel='rbf', C=1.0, random_state=42))
]
stacking_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(C=0.1, max_iter=1000),
cv=5,
stack_method='predict_proba',
n_jobs=-1
)
stacking_clf.fit(X_train, y_train)
print(f"Stacking AUC: {roc_auc_score(y_test, stacking_clf.predict_proba(X_test)[:,1]):.4f}")
Manual OOF Stacking (More Control)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
def generate_oof_predictions(estimators, X, y, n_folds=5):
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
oof_preds = np.zeros((len(y), len(estimators)))
for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
for clf_idx, (name, clf) in enumerate(estimators):
from sklearn.base import clone
clf_fold = clone(clf)
clf_fold.fit(X_tr, y_tr)
oof_preds[val_idx, clf_idx] = clf_fold.predict_proba(X_val)[:, 1]
print(f"Fold {fold+1}/{n_folds} complete")
return oof_preds
# Generate OOF
oof = generate_oof_predictions(estimators, X_train, y_train, n_folds=5)
# Individual OOF performance
for (name, _), col in zip(estimators, oof.T):
auc = roc_auc_score(y_train, col)
print(f"{name} OOF AUC: {auc:.4f}")
# Train meta-learner on OOF
meta = LogisticRegression(C=0.1, max_iter=1000)
meta.fit(oof, y_train)
# Train final base learners on full data
final_clfs = [(name, clone(clf).fit(X_train, y_train)) for name, clf in estimators]
# Test predictions
test_meta_features = np.column_stack([
clf.predict_proba(X_test)[:, 1]
for _, clf in final_clfs
])
y_pred = meta.predict_proba(test_meta_features)[:, 1]
print(f"Stacking AUC: {roc_auc_score(y_test, y_pred):.4f}")
Blending Implementation
from sklearn.model_selection import train_test_split
import numpy as np
# Step 1: Split
X_base, X_hold, y_base, y_hold = train_test_split(
X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)
# Step 2: Train base learners on X_base
fitted_base = [(name, clone(clf).fit(X_base, y_base)) for name, clf in estimators]
# Step 3: Generate holdout meta-features
hold_meta = np.column_stack([
clf.predict_proba(X_hold)[:, 1]
for _, clf in fitted_base
])
# Step 4: Train meta-learner on holdout
meta = LogisticRegression(C=0.1)
meta.fit(hold_meta, y_hold)
# Step 5: Retrain base learners on ALL training data
final_base = [(name, clone(clf).fit(X_train, y_train)) for name, clf in estimators]
# Step 6: Test prediction
test_meta = np.column_stack([
clf.predict_proba(X_test)[:, 1]
for _, clf in final_base
])
y_pred = meta.predict_proba(test_meta)[:, 1]
Time-Series Safe Stacking
from sklearn.model_selection import TimeSeriesSplit
def generate_oof_timeseries(estimators, X, y, n_splits=5):
tscv = TimeSeriesSplit(n_splits=n_splits)
oof_preds = np.zeros((len(y), len(estimators)))
for train_idx, val_idx in tscv.split(X):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr = y[train_idx]
for clf_idx, (name, clf) in enumerate(estimators):
from sklearn.base import clone
clf_fold = clone(clf).fit(X_tr, y_tr)
oof_preds[val_idx, clf_idx] = clf_fold.predict_proba(X_val)[:, 1]
return oof_preds
18. Competition-Level Stacking Strategy
In Kaggle and data science competitions, stacking is often the final step before submission. The typical competition stacking workflow:
Layer 1 — Diverse Base Models
# Include diverse algorithm families
layer1_models = [
# Tree-based
('rf', RandomForestClassifier(n_estimators=1000, ...)),
('xgb', XGBClassifier(...)),
('lgb', LGBMClassifier(...)),
('cat', CatBoostClassifier(...)),
# Linear
('lr', LogisticRegression(...)),
# Non-parametric
('knn', KNeighborsClassifier(n_neighbors=15)),
# Kernel
('svm', SVC(kernel='rbf', probability=True)),
]
Layer 1 OOF Generation
# Use 10-fold CV for more stable OOF predictions
oof_l1 = generate_oof_predictions(layer1_models, X_train, y_train, n_folds=10)
Layer 2 — Second-Level Models
layer2_models = [
('xgb_l2', XGBClassifier(max_depth=3, n_estimators=200, ...)),
('lgb_l2', LGBMClassifier(num_leaves=15, n_estimators=200, ...)),
]
oof_l2 = generate_oof_predictions(layer2_models, oof_l1, y_train, n_folds=10)
Layer 3 — Meta-Learner
meta = LogisticRegression(C=0.05) # Heavy regularization at top level
meta.fit(oof_l2, y_train)
Final Test Prediction
# Average K-fold test predictions for each model (standard competition technique)
def avg_test_predictions(estimators, X_train, y_train, X_test, n_folds=10):
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
test_preds = np.zeros((len(X_test), len(estimators)))
for fold, (train_idx, _) in enumerate(kf.split(X_train, y_train)):
for clf_idx, (name, clf) in enumerate(estimators):
clf_fold = clone(clf).fit(X_train[train_idx], y_train[train_idx])
test_preds[:, clf_idx] += clf_fold.predict_proba(X_test)[:, 1]
return test_preds / n_folds # Average over K folds
19. When to Use It
Use Stacking when:
- You have multiple diverse, individually strong classifiers and want to squeeze out maximum performance
- You're in a competition where 2–5% accuracy gain is significant
- Data is large enough (n > 5000) for OOF predictions to be reliable
- You have the computational budget for K × B base learner fits
- Non-linear combination of classifiers would help (use GBT meta-learner)
- You want the theoretically optimal ensemble for your candidate model set
Use Blending instead when:
- Data is plentiful and losing 20% to holdout is acceptable
- You need faster iteration (1 base fit per model instead of K)
- You're prototyping before committing to full stacking
Use Voting instead when:
- Simplicity is preferred and 1–3% gain is acceptable
- You don't have OOF infrastructure set up
- Computational budget is limited
Do NOT use Stacking when:
- Data is very small (n < 1000) — OOF predictions are too noisy
- Base classifiers are weakly performing — garbage meta-features → garbage meta-model
- All classifiers are highly correlated — no diversity to exploit
- No time/budget for K-fold retraining of all base models
Summary
┌──────────────────────────────────────────────────────────────────────┐
│ STACKING & BLENDING AT A GLANCE │
├──────────────────────────────────────────────────────────────────────┤
│ CORE IDEA Meta-learner learns optimal combination of base preds │
│ OOF KEY Cross-validated OOF prevents leakage — honest preds │
│ META INPUT [P̂₁(y=1|x), ..., P̂_B(y=1|x)] → meta features │
│ META LEARNER LR (default) → GBT (aggressive) — always regularize │
│ GAIN 2–5% over best single model (2× voting) │
│ THEORY Super learning oracle inequality → asymptotic optimum │
│ │
│ BLENDING Holdout (not OOF) — faster, less data-efficient │
│ BLENDING vs Stacking: +data efficiency; Blending: +speed │
│ STACKING │
│ │
│ STRENGTH Learns combination, 2× voting gain, asymp. optimal │
│ WEAKNESS K× training cost, complex, small data problematic │
│ BEST FOR Competitions, max accuracy, heterogeneous base models │
└──────────────────────────────────────────────────────────────────────┘
Stacking is what happens when you take the voting ensemble idea seriously enough to ask: "Who decides the weights?" Voting says the weights are equal and fixed. Stacking says the weights should be learned from data — and that learning can be conditional on the input, non-linear, and adaptive to each base classifier's specific failure modes. The OOF mechanism is the key engineering insight: it gives the meta-learner honest, unbiased predictions to learn from, transforming what could be a leakage disaster into a principled, theoretically grounded algorithm. Super learning shows this isn't just a heuristic — it's an asymptotically optimal procedure. The 2–5% accuracy gain it provides in practice is often the difference between a good model and a winning solution.