Stacking & Blending

Stacked Generalization and Holdout-Based Ensembling

"Don't just average your experts. Train a meta-expert to know when each expert is right."

1. What Are Stacking and Blending?

Stacking (Stacked Generalization) is an ensemble method introduced by Wolpert (1992) that trains a meta-learner to optimally combine the predictions of multiple base learners. Unlike voting (which uses a fixed combination rule — average or majority), stacking learns the combination from data.

Blending is a simpler variant of stacking that uses a single holdout set for meta-training instead of cross-validated out-of-fold predictions.

Both share the same philosophy: instead of asking "what does each classifier predict?", ask "given what each classifier predicts, what should the final answer be?" The meta-learner learns to weight each base classifier differently for different regions of the input space — using classifiers that are locally strong while discounting ones that are locally weak.

Property Stacking Blending
Training method K-fold cross-validation (OOF) Fixed holdout set
Data efficiency Uses all data for base learners Sacrifices holdout data (10–20%)
Overfitting risk Low (OOF predictions are honest) Low-medium (holdout not seen by base)
Computation K × base learner training 1 × base learner training
Leakage risk None (OOF prevents leakage) None (holdout not used for base)
sklearn class StackingClassifier Manual implementation
Complexity High Medium

2. The Core Idea — Learning to Combine

Voting ensemble: Fixed combination rule — take the average or majority.

F_voting(x) = (1/B) Σₜ ĥₜ(x)     ← fixed, no learning

Stacking: Learned combination — train a meta-learner on the base learners' predictions.

F_stacking(x) = meta_learner( ĥ₁(x), ĥ₂(x), ..., ĥ_B(x) )   ← learned from data

The meta-learner sees what each base learner predicts and learns:

The key insight: If Classifier A is consistently better on class 0 and Classifier B is consistently better on class 1, a meta-learner can learn this and weight them accordingly — something a fixed average cannot do.


3. Stacking — Full Algorithm

3.1 Cross-Validated Out-of-Fold Predictions

The fundamental challenge in stacking: the meta-learner must train on the base learners' predictions — but if the base learners saw the meta-training data, their predictions would be overfit (optimistic). This is called target leakage.

Solution: Out-of-Fold (OOF) predictions via K-fold cross-validation.

Given: Training data D = {(x₁,y₁),...,(xₘ,yₘ)}, K folds, B base learners

Phase 1 — Generate OOF meta-features:

    Split D into K folds: D₁, D₂, ..., D_K

    OOF_predictions = empty array of shape (m, B × K_or_1_proba_cols)

    For b = 1 to B:
        For k = 1 to K:
            Train base_b on D \ D_k  (all data except fold k)
            Predict on D_k:
                OOF_predictions[D_k, b] = base_b.predict_proba(D_k)

    # Now OOF_predictions[i, b] = prediction of base_b for sample i
    # when base_b was NOT trained on sample i  ← honest, no leakage

Phase 2 — Train final base learners:

    For b = 1 to B:
        base_b_final = train base_b on full D
        # This version is used for test-time prediction

Phase 3 — Train meta-learner:

    meta_learner.fit(OOF_predictions, y)

Phase 4 — Prediction on test data:

    For each test sample x:
        base_preds = [base_b_final.predict_proba(x) for b in 1..B]
        final_pred = meta_learner.predict(base_preds)

3.2 Why Out-of-Fold Is Essential

The leakage problem without OOF:

WRONG approach:
  1. Train all base learners on D
  2. Get their predictions on D (in-sample predictions)
  3. Train meta-learner on these predictions

Problem: Base learners have memorized D → in-sample predictions are near-perfect
→ Meta-learner learns to trust base learners unconditionally
→ On test data, base learners are less accurate → meta-learner fails

Why OOF predictions are the correct solution:

For each sample i, OOF prediction is generated by a base learner that was not trained on sample i. The OOF prediction for sample i approximates what the base learner would predict on a genuinely unseen sample — the same distribution as test predictions.

The meta-learner therefore trains on predictions that have the same statistical properties as test-time predictions. No leakage, no optimistic bias.

Formal guarantee: OOF predictions are asymptotically equivalent to using an infinite number of training samples — they converge to the true out-of-distribution prediction as K → ∞ (leave-one-out cross-validation).


3.3 Training the Meta-Learner

Meta-features: The matrix fed to the meta-learner, shape (m, B × num_proba_cols):

For binary classification with B base learners using predict_proba:

Meta-features per sample i:
[P̂₁(y=0|xᵢ), P̂₁(y=1|xᵢ), P̂₂(y=0|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=0|xᵢ), P̂_B(y=1|xᵢ)]
shape: (m, 2B)

Or using only the positive class probabilities (common for binary):

[P̂₁(y=1|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=1|xᵢ)]
shape: (m, B)

Choice of meta-learner: The meta-learner trains on these meta-features:

meta_learner.fit(OOF_predictions, y_train)

Common choices:


3.4 Multi-Level Stacking

Stack multiple layers — the output of one stacking layer becomes the input to the next:

Layer 1 (base learners):  LR, RF, GBT, SVM, kNN → OOF predictions₁
Layer 2 (stacking):       XGBoost, LightGBM trained on OOF predictions₁ → OOF predictions₂
Layer 3 (meta-learner):   Logistic Regression trained on OOF predictions₂

Does multi-level stacking help?

Empirically: yes, but with diminishing returns. Layer 1 → Layer 2 can provide meaningful improvement on competitive tasks. Layer 2 → Layer 3 often provides marginal gains and risks overfitting. More than 3 levels rarely helps.

The risk: Each additional layer requires another round of OOF generation. Computational cost multiplies, and the meta-features from deeper layers have lower effective training set size.


4. Blending — The Holdout Variant

4.1 The Blending Algorithm

Blending is a simpler but slightly less data-efficient alternative to stacking:

Given: Training data D, holdout fraction h (e.g., 0.2)

Phase 1 — Split:
    D_train = (1-h) fraction of D
    D_holdout = h fraction of D

Phase 2 — Train base learners on D_train:
    For b = 1 to B:
        base_b.fit(D_train)

Phase 3 — Generate meta-features on D_holdout:
    meta_features_holdout = [base_b.predict_proba(D_holdout) for b in 1..B]

Phase 4 — Train meta-learner on holdout predictions:
    meta_learner.fit(meta_features_holdout, y_holdout)

Phase 5 — Retrain base learners on all data (optional):
    For b = 1 to B:
        base_b_final.fit(D)   # Use all data for final base learners

Phase 6 — Test prediction:
    For each test sample x:
        base_preds = [base_b_final.predict_proba(x) for b in 1..B]
        final_pred = meta_learner.predict(base_preds)

4.2 Blending vs. Stacking — Full Comparison

Property Stacking (OOF) Blending (Holdout)
Meta-training data size Full training set (m samples) Holdout only (h·m samples)
Base learner training data Full training set Reduced (1-h) fraction
Leakage risk None (OOF guarantee) None (holdout not in base training)
Computation K × B base fits + 1 meta fit B base fits + 1 meta fit
Reproducibility Fully reproducible (CV splits) Depends on holdout split
Statistical efficiency Higher (uses all data) Lower (wastes h fraction)
Risk of meta overfit Lower (more meta-training data) Higher (smaller meta-training set)
Implementation complexity High Low
sklearn support ✅ StackingClassifier ❌ Manual only
Best use case When data is limited When data is plentiful / fast iteration

When to use blending:


5. What the Meta-Learner Learns

With logistic regression as the meta-learner:

P(y=1 | x) = sigmoid( w₁·P̂₁(y=1|x) + w₂·P̂₂(y=1|x) + ... + w_B·P̂_B(y=1|x) + b )

The learned weights wₜ reveal:

With gradient boosting as the meta-learner, it can learn:

"If classifier A predicts y=1 with high confidence AND classifier B predicts y=0,
 then the correct answer is more likely y=0 because B is reliable in this region"

This conditional combination is impossible with a fixed voting rule.

Examining the meta-learner:

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

clf = StackingClassifier(estimators, final_estimator=LogisticRegression())
clf.fit(X_train, y_train)

# For logistic regression meta-learner
meta_lr = clf.final_estimator_
print("Meta-learner coefficients (base learner weights):")
for (name, _), coef in zip(estimators, meta_lr.coef_[0]):
    print(f"  {name}: {coef:.4f}")

6. Choosing Base Learners

Cardinal rule: maximize diversity, maintain quality.

Diversity strategies:

Quality threshold:

from sklearn.model_selection import cross_val_score
import numpy as np

# Only include classifiers above baseline
baseline_auc = 0.75  # Minimum acceptable AUC for inclusion

selected = []
for name, clf in candidates:
    cv_auc = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc').mean()
    if cv_auc >= baseline_auc:
        selected.append((name, clf))
        print(f"✓ {name}: AUC = {cv_auc:.4f}")
    else:
        print(f"✗ {name}: AUC = {cv_auc:.4f} — below threshold")

What NOT to include:


7. Choosing the Meta-Learner

Logistic Regression (most recommended):

from sklearn.linear_model import LogisticRegression

meta = LogisticRegression(C=0.1, max_iter=1000)
# C=0.1 (strong regularization) — prevents meta-learner from overfitting

Why logistic regression is the standard choice:

Ridge Classifier:

from sklearn.linear_model import RidgeClassifier
meta = RidgeClassifier(alpha=10.0)

Forces a smooth linear combination — useful when OOF meta-features are limited.

Gradient Boosting as meta-learner (aggressive):

from sklearn.ensemble import GradientBoostingClassifier
meta = GradientBoostingClassifier(n_estimators=100, max_depth=2, learning_rate=0.05)

Can capture non-linear combinations — more powerful but needs careful regularization (shallow trees, low learning rate) to avoid overfitting the small meta-training set.

LightGBM as meta-learner (competition setting):

import lightgbm as lgb
meta = lgb.LGBMClassifier(n_estimators=200, num_leaves=15, learning_rate=0.05)

Very fast, handles many meta-features, good regularization.

Rule: The meta-learner should be simpler and more regularized than the base learners. The base learners do the heavy lifting; the meta-learner learns to trust them appropriately.


8. Feature Engineering for Stacking

The meta-learner receives base learner predictions as input. You can enrich this with:

Option 1: Probabilities only (standard)

meta_features = [P̂₁(y=1|xᵢ), P̂₂(y=1|xᵢ), ..., P̂_B(y=1|xᵢ)]

Option 2: Probabilities + original features (passthrough)

# sklearn's StackingClassifier supports this with passthrough=True
clf = StackingClassifier(estimators, final_estimator=meta_lr, passthrough=True)

With passthrough=True, the meta-learner receives both the base predictions AND the original features. This allows the meta-learner to learn "in this region of feature space, trust classifier A more."

Option 3: Predictions + rank features

import numpy as np
from scipy.stats import rankdata

# Rank predictions across samples
meta_rank_features = np.column_stack([
    rankdata(oof_pred) / len(oof_pred)
    for oof_pred in oof_predictions.T
])
# Rank-transformed features are uniform [0,1] — helps regularize

Option 4: Prediction uncertainty features

# Entropy of each classifier's probability vector
meta_entropy = np.column_stack([
    -np.sum(proba * np.log(proba + 1e-10), axis=1)
    for proba in base_proba_matrices
])

High entropy → classifier is uncertain → meta-learner can downweight it.


9. StackingClassifier in sklearn — Full API

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

estimators = [
    ('lr',  LogisticRegression(max_iter=1000)),
    ('rf',  RandomForestClassifier(n_estimators=300, n_jobs=-1)),
    ('gbt', GradientBoostingClassifier(n_estimators=200)),
    ('svm', SVC(probability=True))
]

clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(C=0.1),   # Meta-learner
    cv=5,                    # K in K-fold cross-validation for OOF
    stack_method='predict_proba',  # 'predict_proba', 'decision_function', 'predict'
    n_jobs=-1,
    passthrough=False,       # If True, pass original features to meta-learner too
    verbose=0
)

clf.fit(X_train, y_train)

# Predictions
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)

# Access components
clf.estimators_           # List of fitted base estimators (fitted on full training data)
clf.named_estimators_     # Dict: name → fitted estimator
clf.final_estimator_      # Fitted meta-learner

# OOF predictions used to train meta-learner (not stored by default)
# To access: use cross_val_predict manually

Getting OOF predictions manually:

from sklearn.model_selection import cross_val_predict
import numpy as np

# Generate OOF predictions for each base classifier
oof_predictions = np.column_stack([
    cross_val_predict(clf, X_train, y_train, cv=5, method='predict_proba')[:, 1]
    for _, clf in estimators
])
# shape: (m, B)

# Analyze OOF performance
from sklearn.metrics import roc_auc_score
for (name, _), oof_pred in zip(estimators, oof_predictions.T):
    auc = roc_auc_score(y_train, oof_pred)
    print(f"{name} OOF AUC: {auc:.4f}")

10. Stacking for Regression

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

estimators = [
    ('rf',  RandomForestRegressor(n_estimators=200, n_jobs=-1)),
    ('gbt', GradientBoostingRegressor(n_estimators=200, learning_rate=0.05)),
    ('svr', SVR(kernel='rbf', C=1.0))
]

reg = StackingRegressor(
    estimators=estimators,
    final_estimator=Ridge(alpha=1.0),
    cv=5,
    n_jobs=-1,
    passthrough=False
)
reg.fit(X_train, y_train)

Meta-learner for regression:


11. The Bias-Variance Profile

Stacking improves over individual base learners by reducing both bias and variance:

Bias reduction: The meta-learner can correct systematic errors of individual base classifiers. If RF consistently overestimates class 1 probability by 0.05, the meta-learner learns to discount RF's positive predictions.

Variance reduction: The meta-learner averages over multiple predictions — the variance of the combination is lower than any individual component.

The improvement depends on:

Diversity:      Higher diversity → larger bias reduction
Individual quality: Higher individual accuracy → better meta-features → better meta-model
Meta-learning:  More regularized meta-learner → lower variance of combination

Practical expectation: Stacking over 5 diverse classifiers typically improves AUC by 2–5% over the best single classifier — roughly 2× the gain from voting.


12. Theoretical Justification — Super Learning

Stacking is a specific case of Super Learning (van der Laan, Polley, Hubbard, 2007) — a theoretically grounded ensemble method with optimality guarantees.

The Oracle Inequality for Super Learning:

Let L* = risk of the oracle (best possible predictor). Let L_SL = risk of the super learner. Then:

E[L_SL] ≤ E[L_oracle] + C · (log B) / n

Where C is a constant and n is the sample size. As n → ∞:

L_SL / L_oracle → 1   (asymptotic optimality)

The super learner converges to the oracle predictor — the best possible predictor achievable from the given hypothesis class — at a rate that is only logarithmically worse.

Practical implication: With sufficient data, stacking will find the best linear combination of the given base classifiers — it cannot do worse than any individual base classifier (the meta-learner can zero-weight poor classifiers).

Connection to cross-validated model selection: Model selection (picking the single best classifier by CV) is equivalent to stacking with a 0-1 weight vector. Stacking generalizes model selection by finding optimal non-integer weights.


13. Assumptions

Assumption Notes
IID samples Required for OOF cross-validation validity
Exchangeability of folds OOF predictions are only "honest" if folds are exchangeable
No temporal structure For time-series data, use time-based CV for OOF generation
Sufficient meta-training data Meta-learner needs enough OOF predictions to learn from
Base learner quality At least some base learners must be better than random
Calibrated base probabilities For effective meta-learning — miscalibrated probs mislead meta
No target leakage OOF ensures this; blending ensures it via holdout

14. Advantages

✅ Learns Optimal Combination

Unlike voting (fixed combination), stacking learns which classifiers to trust and when — exploiting conditional expertise.

✅ Asymptotically Optimal (Super Learning)

Oracle inequality guarantees convergence to the best possible predictor from the candidate set.

✅ Never Worse Than Best Base Learner (with regularized meta)

A ridge-penalized logistic regression meta-learner can always zero-weight poor classifiers, falling back to the best individual model.

✅ Captures Non-Linear Combinations (with powerful meta-learner)

GBT or neural network meta-learner can learn "in this feature region, use RF; in that region, use SVM" — impossible with voting.

✅ 2–5× Greater Improvement Than Voting

Empirically outperforms voting ensembles in competitions and benchmarks.

✅ Naturally Handles Model Uncertainty

The meta-learner's probability outputs reflect the collective uncertainty of all base models.

✅ Feature Passthrough for Context-Aware Combination

passthrough=True allows the meta-learner to use original features for context — "in this region of feature space, trust these classifiers more."


15. Drawbacks & Limitations

❌ Computationally Expensive

K-fold stacking with B base learners and K=5 requires 5B + B + 1 total model fits. For expensive base learners (large random forests, SVMs), this multiplies the training time by K.

❌ Leakage Risk Without OOF

Fitting base learners on training data and then using their in-sample predictions to train the meta-learner is a classic data leakage pattern. Requires careful implementation.

❌ Small Datasets Problematic

With m < 1000 samples, 5-fold CV leaves only 800 samples to train each base learner — suboptimal base learner performance → suboptimal meta-features.

❌ Time-Series/Ordered Data Requires Careful CV

Standard K-fold CV shuffles samples randomly — invalid for time series (future leaks into the past). Must use TimeSeriesSplit or expanding window CV.

❌ Meta-Learner Can Overfit

If the meta-learner is too complex (deep GBT with many trees on small meta-feature matrix), it overfits the OOF predictions. Always regularize the meta-learner more aggressively than the base learners.

❌ Complex Pipeline Maintenance

A stacking pipeline with 5 base learners, OOF generation, and a meta-learner has many moving parts. Production deployment requires careful engineering.

❌ Interpretability Loss

The two-level architecture makes explanation harder than any single model or simple voting ensemble.


16. Stacking vs. Voting vs. Bagging vs. Boosting

Property Stacking Voting Bagging Boosting
Learns combination ✅ Yes ❌ Fixed rule ❌ Fixed average ❌ Fixed (αₜ weights)
Base learner types Heterogeneous Heterogeneous Homogeneous Homogeneous
Training order Parallel (base) + Sequential (meta) Parallel Parallel Sequential
Error type targeted Both bias + variance Variance Variance Bias
Typical gain over best 2–5% 1–3% 5–20% Varies (high)
Overfitting risk Low (with OOF) Low Very low Medium
Implementation complexity High Low Low Medium
sklearn support ✅ StackingClassifier ✅ VotingClassifier ✅ BaggingClassifier ✅ AdaBoost/GBT

17. Practical Tips & Gotchas

Basic Stacking Setup

from sklearn.ensemble import (StackingClassifier, RandomForestClassifier,
                               GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

estimators = [
    ('rf',  RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42)),
    ('gbt', GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=42)),
    ('svm', SVC(probability=True, kernel='rbf', C=1.0, random_state=42))
]

stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(C=0.1, max_iter=1000),
    cv=5,
    stack_method='predict_proba',
    n_jobs=-1
)

stacking_clf.fit(X_train, y_train)
print(f"Stacking AUC: {roc_auc_score(y_test, stacking_clf.predict_proba(X_test)[:,1]):.4f}")

Manual OOF Stacking (More Control)

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def generate_oof_predictions(estimators, X, y, n_folds=5):
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    oof_preds = np.zeros((len(y), len(estimators)))

    for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
        X_tr, X_val = X[train_idx], X[val_idx]
        y_tr, y_val = y[train_idx], y[val_idx]

        for clf_idx, (name, clf) in enumerate(estimators):
            from sklearn.base import clone
            clf_fold = clone(clf)
            clf_fold.fit(X_tr, y_tr)
            oof_preds[val_idx, clf_idx] = clf_fold.predict_proba(X_val)[:, 1]

        print(f"Fold {fold+1}/{n_folds} complete")

    return oof_preds

# Generate OOF
oof = generate_oof_predictions(estimators, X_train, y_train, n_folds=5)

# Individual OOF performance
for (name, _), col in zip(estimators, oof.T):
    auc = roc_auc_score(y_train, col)
    print(f"{name} OOF AUC: {auc:.4f}")

# Train meta-learner on OOF
meta = LogisticRegression(C=0.1, max_iter=1000)
meta.fit(oof, y_train)

# Train final base learners on full data
final_clfs = [(name, clone(clf).fit(X_train, y_train)) for name, clf in estimators]

# Test predictions
test_meta_features = np.column_stack([
    clf.predict_proba(X_test)[:, 1]
    for _, clf in final_clfs
])
y_pred = meta.predict_proba(test_meta_features)[:, 1]
print(f"Stacking AUC: {roc_auc_score(y_test, y_pred):.4f}")

Blending Implementation

from sklearn.model_selection import train_test_split
import numpy as np

# Step 1: Split
X_base, X_hold, y_base, y_hold = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

# Step 2: Train base learners on X_base
fitted_base = [(name, clone(clf).fit(X_base, y_base)) for name, clf in estimators]

# Step 3: Generate holdout meta-features
hold_meta = np.column_stack([
    clf.predict_proba(X_hold)[:, 1]
    for _, clf in fitted_base
])

# Step 4: Train meta-learner on holdout
meta = LogisticRegression(C=0.1)
meta.fit(hold_meta, y_hold)

# Step 5: Retrain base learners on ALL training data
final_base = [(name, clone(clf).fit(X_train, y_train)) for name, clf in estimators]

# Step 6: Test prediction
test_meta = np.column_stack([
    clf.predict_proba(X_test)[:, 1]
    for _, clf in final_base
])
y_pred = meta.predict_proba(test_meta)[:, 1]

Time-Series Safe Stacking

from sklearn.model_selection import TimeSeriesSplit

def generate_oof_timeseries(estimators, X, y, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    oof_preds = np.zeros((len(y), len(estimators)))

    for train_idx, val_idx in tscv.split(X):
        X_tr, X_val = X[train_idx], X[val_idx]
        y_tr = y[train_idx]

        for clf_idx, (name, clf) in enumerate(estimators):
            from sklearn.base import clone
            clf_fold = clone(clf).fit(X_tr, y_tr)
            oof_preds[val_idx, clf_idx] = clf_fold.predict_proba(X_val)[:, 1]

    return oof_preds

18. Competition-Level Stacking Strategy

In Kaggle and data science competitions, stacking is often the final step before submission. The typical competition stacking workflow:

Layer 1 — Diverse Base Models

# Include diverse algorithm families
layer1_models = [
    # Tree-based
    ('rf',      RandomForestClassifier(n_estimators=1000, ...)),
    ('xgb',     XGBClassifier(...)),
    ('lgb',     LGBMClassifier(...)),
    ('cat',     CatBoostClassifier(...)),
    # Linear
    ('lr',      LogisticRegression(...)),
    # Non-parametric
    ('knn',     KNeighborsClassifier(n_neighbors=15)),
    # Kernel
    ('svm',     SVC(kernel='rbf', probability=True)),
]

Layer 1 OOF Generation

# Use 10-fold CV for more stable OOF predictions
oof_l1 = generate_oof_predictions(layer1_models, X_train, y_train, n_folds=10)

Layer 2 — Second-Level Models

layer2_models = [
    ('xgb_l2', XGBClassifier(max_depth=3, n_estimators=200, ...)),
    ('lgb_l2', LGBMClassifier(num_leaves=15, n_estimators=200, ...)),
]
oof_l2 = generate_oof_predictions(layer2_models, oof_l1, y_train, n_folds=10)

Layer 3 — Meta-Learner

meta = LogisticRegression(C=0.05)   # Heavy regularization at top level
meta.fit(oof_l2, y_train)

Final Test Prediction

# Average K-fold test predictions for each model (standard competition technique)
def avg_test_predictions(estimators, X_train, y_train, X_test, n_folds=10):
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    test_preds = np.zeros((len(X_test), len(estimators)))

    for fold, (train_idx, _) in enumerate(kf.split(X_train, y_train)):
        for clf_idx, (name, clf) in enumerate(estimators):
            clf_fold = clone(clf).fit(X_train[train_idx], y_train[train_idx])
            test_preds[:, clf_idx] += clf_fold.predict_proba(X_test)[:, 1]

    return test_preds / n_folds    # Average over K folds

19. When to Use It

Use Stacking when:

Use Blending instead when:

Use Voting instead when:

Do NOT use Stacking when:


Summary

┌──────────────────────────────────────────────────────────────────────┐
│              STACKING & BLENDING AT A GLANCE                        │
├──────────────────────────────────────────────────────────────────────┤
│  CORE IDEA    Meta-learner learns optimal combination of base preds  │
│  OOF KEY      Cross-validated OOF prevents leakage — honest preds   │
│  META INPUT   [P̂₁(y=1|x), ..., P̂_B(y=1|x)] → meta features       │
│  META LEARNER LR (default) → GBT (aggressive) — always regularize   │
│  GAIN         2–5% over best single model (2× voting)               │
│  THEORY       Super learning oracle inequality → asymptotic optimum  │
│                                                                      │
│  BLENDING     Holdout (not OOF) — faster, less data-efficient       │
│  BLENDING vs  Stacking: +data efficiency; Blending: +speed          │
│  STACKING                                                            │
│                                                                      │
│  STRENGTH     Learns combination, 2× voting gain, asymp. optimal    │
│  WEAKNESS     K× training cost, complex, small data problematic     │
│  BEST FOR     Competitions, max accuracy, heterogeneous base models  │
└──────────────────────────────────────────────────────────────────────┘

Stacking is what happens when you take the voting ensemble idea seriously enough to ask: "Who decides the weights?" Voting says the weights are equal and fixed. Stacking says the weights should be learned from data — and that learning can be conditional on the input, non-linear, and adaptive to each base classifier's specific failure modes. The OOF mechanism is the key engineering insight: it gives the meta-learner honest, unbiased predictions to learn from, transforming what could be a leakage disaster into a principled, theoretically grounded algorithm. Super learning shows this isn't just a heuristic — it's an asymptotically optimal procedure. The 2–5% accuracy gain it provides in practice is often the difference between a good model and a winning solution.

Powered by Forestry.md