BaggingClassifier

1. What Is Bagging?

Bagging (Bootstrap Aggregating), introduced by Leo Breiman in 1996, is the general framework for training multiple instances of the same base learner on different bootstrap samples of the training data and aggregating their predictions.

sklearn's BaggingClassifier is the direct, general-purpose implementation of this framework — it can apply bagging to any sklearn-compatible classifier. Random Forest and Extra-Trees are specialized, highly optimized instances of bagging restricted to decision trees; BaggingClassifier generalizes bagging to arbitrary base learners.

Property Value
Introduced Leo Breiman, 1996
Task Classification and Regression
Base learner Any sklearn estimator
Key mechanism Bootstrap samples + prediction aggregation
Primary effect Variance reduction
sklearn class sklearn.ensemble.BaggingClassifier

2. Historical Context — Breiman 1996

Breiman's 1996 paper "Bagging Predictors" provided the first principled, general ensemble framework with three key contributions:

  1. Identified instability as the driver of variance — classifiers that change substantially with small data perturbations (trees, neural networks) benefit most from bagging; stable ones (linear models) benefit little
  2. Proved variance reduction — the expected squared error of the bagged ensemble never exceeds the average squared error of individual models
  3. Bootstrap as the natural perturbation — sampling with replacement simulates having different training datasets from the same data-generating process

Random Forest (Breiman 2001) is bagging's most successful application, but the framework applies to any classifier.


3. The Core Principle — Why Averaging Helps

Unstable vs. Stable Estimators

Bagging is most beneficial for unstable estimators — classifiers whose predictions change substantially when trained on slightly different data:

Unstable (bagging helps greatly):
    Unpruned decision trees  → small data change → completely different tree structure
    k-NN with small k        → small data change → different nearest neighbors
    Neural networks          → random initialization → different local optima

Stable (bagging helps little):
    Linear/logistic regression → robust to small data perturbations
    Linear SVM                 → convex optimization → nearly identical solutions
    k-NN with large k          → large neighborhood averages perturbations

Variance Decomposition

For any prediction f̂(x) trained on dataset D, expected test error decomposes as:

E_D[(y − f̂(x))²] = Bias² + Variance + Irreducible Noise

Bias²    = (E_D[f̂(x)] − f*(x))²
Variance = E_D[(f̂(x) − E_D[f̂(x)])²]

Bagging approximates E_D[f̂(x)] by averaging over bootstrap samples:

f̂_bagged(x) = (1/B) Σ_b f̂_b(x)  ≈  E_D[f̂(x)]

Effect on Bias: Negligible — bootstrap samples have the same distribution as D.
Effect on Variance: Reduced from σ² toward ρσ² (the correlation floor).


4. Mathematical Foundation

Bootstrap Statistics

P(sample i in bootstrap D_b) = 1 − (1 − 1/m)^m → 1 − 1/e ≈ 63.2%
P(sample i is OOB for tree b) ≈ 36.8%
E[# appearances of i in D_b] = 1  (Poisson(1) distribution)

Variance Reduction Formula

With B models, each having variance σ² and pairwise correlation ρ:

Var(f̂_bagged) = ρσ² + (1−ρ)σ²/B

As B → ∞:   Var → ρσ²    (the correlation floor)

For unstable base learners, different bootstrap samples produce very different models → ρ is small → large variance reduction. For stable base learners, ρ ≈ 1 → bagging provides no benefit.

Prediction

Soft voting (default — averages probabilities):

f̂_bagged(x) = argmax_k (1/B) Σ_b P̂_b(y=k | x)

Hard voting (majority of class labels):

f̂_bagged(x) = argmax_k #{b : f̂_b(x) = k}

Soft voting is almost always superior — it uses more information per prediction.


5. The BaggingClassifier Algorithm

Input: Data D, base estimator clf, B, max_samples, max_features,
       bootstrap (samples), bootstrap_features (features)

parallel for b = 1 to B:

    # Sample selection
    if bootstrap:
        S_b = draw max_samples from D WITH replacement
    else:
        S_b = draw max_samples from D WITHOUT replacement    (Pasting)

    # Feature selection
    if bootstrap_features:
        F_b = draw max_features features WITH replacement
    else:
        F_b = draw max_features features WITHOUT replacement

    # Train on (S_b restricted to F_b)
    clf_b = clone(clf).fit(S_b[:, F_b], labels_b)
    ensemble.append((clf_b, F_b))

Predict(x):
    probas = mean([clf_b.predict_proba(x[F_b]) for clf_b, F_b in ensemble])
    return argmax(probas)

Complexity: O(B × cost_of_single_clf_fit) — embarrassingly parallel.


6. Relationship to Random Forest and Extra-Trees

BaggingClassifier(DecisionTreeClassifier) is not equivalent to Random Forest:

Property BaggingClassifier(DecisionTree) Random Forest
Feature randomization Per estimator (if max_features<1) Per split (at every node)
Split finding Optimal over all features at each node Optimal over √p subset at each node
Tree correlation Higher (same features all splits) Lower (different features each split)
Accuracy Lower Higher

Random Forest applies feature randomization at each individual split — a far stronger decorrelation mechanism. BaggingClassifier only subsamples features once per tree. RF almost always outperforms BaggingClassifier(DecisionTree).

To properly approximate RF within BaggingClassifier:

BaggingClassifier(
    estimator=DecisionTreeClassifier(max_features='sqrt'),  # Per-split subset inside tree
    n_estimators=500,
    bootstrap=True,
    n_jobs=-1
)
# Still not identical to RF but closer — the DecisionTreeClassifier handles per-split subsets

7. Bagging with Different Base Learners

Decision Trees (Approximate RF)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_tree = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=None),
    n_estimators=200, bootstrap=True, n_jobs=-1
)

Use when you need custom tree parameters not exposed by RandomForestClassifier.


kNN with small k is extremely unstable → bagging helps enormously:

from sklearn.neighbors import KNeighborsClassifier

bag_knn = BaggingClassifier(
    estimator=KNeighborsClassifier(n_neighbors=5),
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,
    n_jobs=-1
)

Bagged kNN smooths the extremely jagged decision boundaries of individual kNN classifiers — often matching the accuracy of kNN with much larger k while being faster at prediction time.


SVMs

Useful when SVM accuracy must be improved but training data is large (each SVM trains on a small bootstrap):

from sklearn.svm import SVC

bag_svm = BaggingClassifier(
    estimator=SVC(kernel='rbf', C=1.0, probability=True),
    n_estimators=20,       # SVMs are slow — keep small
    max_samples=0.5,       # Smaller bootstraps for speed
    bootstrap=True,
    n_jobs=-1
)

Each SVM trains on ~50% of data → faster and different support vectors → diversity. Rarely the right choice over GBT in practice.


Neural Networks (MLPs)

MLPs are unstable due to random initialization and non-convex optimization:

from sklearn.neural_network import MLPClassifier

bag_mlp = BaggingClassifier(
    estimator=MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300),
    n_estimators=10,       # MLPs are slow
    max_samples=0.8,
    bootstrap=True,
    n_jobs=-1
)

Each MLP converges to a different local minimum — averaging over these reduces variance from random initialization.


Logistic Regression — Rarely Beneficial

Logistic regression is stable → bagging provides minimal benefit:

# Not recommended — LR is stable, bagging wastes compute
from sklearn.linear_model import LogisticRegression
bag_lr = BaggingClassifier(estimator=LogisticRegression(), n_estimators=50)
# Bootstrap solutions are nearly identical → ρ ≈ 0.9+ → Var reduction ≈ 0

8. Sampling Strategies: Pasting, Subspaces, Patches

Method bootstrap (samples) bootstrap_features max_samples max_features
Bagging True False 1.0 1.0
Pasting False False < 1.0 1.0
Random Subspaces False False 1.0 < 1.0
Random Patches True False < 1.0 < 1.0

Pasting — samples without replacement:

bag_paste = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    max_samples=0.7, bootstrap=False, n_jobs=-1
)

Random Subspaces (Ho 1998) — all samples, random features:

bag_subspace = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    max_samples=1.0, bootstrap=False,
    max_features=0.6, bootstrap_features=False, n_jobs=-1
)

Random Patches — both sample and feature subsampling:

bag_patches = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    max_samples=0.7, bootstrap=True,
    max_features=0.6, bootstrap_features=False, n_jobs=-1
)
# Most decorrelated; highest diversity; best for very high-dimensional data

9. Out-of-Bag Evaluation

When bootstrap=True, ~36.8% of samples are OOB per tree — free validation:

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=200,
    bootstrap=True,
    oob_score=True,
    n_jobs=-1
)
bag.fit(X_train, y_train)

print(f"OOB accuracy: {bag.oob_score_:.4f}")
# bag.oob_decision_function_: shape (m, K) — per-sample OOB probabilities

10. Bias-Variance Profile

Base Learner Single Variance ρ (bagged) Bagging Benefit
Unpruned decision tree Very high ~0.2–0.5 ✅✅ Enormous
kNN (k=5) High ~0.3–0.6 ✅ Large
MLP (random init) High ~0.4–0.6 ✅ Moderate
SVM (RBF) Medium ~0.5–0.7 ✅ Moderate
Logistic Regression Low ~0.9+ ❌ Minimal

The ρ values are approximate — they depend heavily on the dataset and hyperparameters.


11. Hyperparameters — Complete Reference

from sklearn.ensemble import BaggingClassifier

BaggingClassifier(
    estimator=None,           # Base learner (default: DecisionTreeClassifier())
    n_estimators=10,          # Number of base estimators — set 100–500
    max_samples=1.0,          # Samples per estimator (int or float fraction)
    max_features=1.0,         # Features per estimator (int or float fraction)
    bootstrap=True,           # Sample WITH replacement
    bootstrap_features=False, # Features WITH replacement
    oob_score=False,          # Requires bootstrap=True
    warm_start=False,         # Add estimators incrementally
    n_jobs=-1,
    random_state=42,
    verbose=0
)

Priority:

1. estimator:      Choose base learner for your data/problem
2. n_estimators:   100–500 for production; more for unstable base learners
3. max_samples:    1.0 (standard); 0.5–0.8 to speed training or add regularization
4. max_features:   1.0 (standard); < 1.0 for random patches / high-dim data
5. bootstrap:      True for bagging+OOB; False for pasting

12. Assumptions

Assumption Notes
IID samples Bootstrap theory requires exchangeable samples
Unstable base learner Bagging works best for high-variance, low-bias estimators
Sufficient B B ≥ 100 for stable OOB estimate and stable ensemble
Base learner correctness Each base learner must be better than random chance

13. Advantages

✅ Completely General

Wraps any sklearn-compatible estimator — the only ensemble method that applies bagging to kNN, SVM, MLP, custom classifiers, etc.

✅ Proven Variance Reduction

Mathematically guaranteed to reduce variance for any unstable estimator, without increasing bias.

✅ Free OOB Validation

When bootstrap=True, no held-out set needed.

✅ All Sampling Strategies in One API

Bagging, Pasting, Random Subspaces, Random Patches — four distinct methods through the same interface.

✅ Fully Parallelizable

All B estimators are independent — linear CPU scaling.

✅ Simple and Transparent

Easy to understand, easy to debug, easy to inspect individual models.


14. Drawbacks & Limitations

❌ Less Specialized Than Random Forest

For trees, RF's per-split feature randomization produces better decorrelation than BaggingClassifier's per-estimator sampling.

❌ Slow for Expensive Base Learners

100 SVMs or 100 MLPs is prohibitively expensive — keep n_estimators small or use smaller max_samples.

❌ No Built-in Feature Importance

Must aggregate manually from individual estimators.

❌ Memory Intensive

Stores B complete model objects.

❌ Minimal Benefit for Stable Learners

Bagging logistic regression or linear SVM is a waste of compute.


15. Bagging vs. Boosting vs. Stacking

Property Bagging Boosting Stacking
Training order Parallel Sequential Two sequential levels
Error targeted Variance Bias Both
Sample weighting Bootstrap (uniform) Error-driven Not applicable
Base learner type Same model Same model Different models
Combination Average / vote Weighted vote Meta-learner
Noise robustness ✅ High ⚠️ Variable ✅ Moderate
Overfit risk Low Medium–High Medium

16. Practical Tips & Gotchas

Bagged kNN — Best Non-Tree Application

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier

bag_knn = BaggingClassifier(
    estimator=KNeighborsClassifier(n_neighbors=5),
    n_estimators=50,
    max_samples=0.8,
    bootstrap=True,
    n_jobs=-1
)
bag_knn.fit(X_train, y_train)
print(f"Accuracy: {bag_knn.score(X_test, y_test):.4f}")

Access Individual Estimators

bag.fit(X_train, y_train)

# Inspect each fitted estimator
for i, (clf, features) in enumerate(zip(bag.estimators_, bag.estimators_features_)):
    preds = clf.predict(X_test[:, features])
    print(f"Model {i}: {(preds == y_test).mean():.4f}")

# Which training samples went to each estimator
for i, indices in enumerate(bag.estimators_samples_):
    print(f"Model {i}: {len(set(indices))} unique samples")

Random Patches for High-Dimensional Data

bag_patches = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=200,
    max_samples=0.6,
    max_features=0.4,
    bootstrap=True,
    bootstrap_features=False,
    n_jobs=-1
)

Manual Feature Importance from Bagged Trees

import numpy as np

# Only works when base estimator has feature_importances_
importances = np.mean(
    [clf.feature_importances_ for clf in bag.estimators_],
    axis=0
)
std = np.std(
    [clf.feature_importances_ for clf in bag.estimators_],
    axis=0
)
# Now importances is the mean MDI across all bagged trees

17. When to Use It

Use BaggingClassifier when:

Use Random Forest instead when:

Do not use BaggingClassifier when:


Summary

┌──────────────────────────────────────────────────────────────────────┐
│               BAGGING CLASSIFIER AT A GLANCE                        │
├──────────────────────────────────────────────────────────────────────┤
│  CORE IDEA    Train B models on B bootstrap samples, average        │
│  PRIMARY      Variance reduction for unstable estimators            │
│  VARIANCE     Var = ρσ² + (1−ρ)σ²/B → ρσ² as B→∞                │
│  BASE LEARNER Any sklearn estimator (kNN, SVM, MLP, Tree, custom)  │
│  SAMPLING     Bagging | Pasting | Random Subspaces | Random Patches │
│  OOB          Free validation when bootstrap=True                   │
│  STRENGTH     Generality, parallelism, principled, proven theory    │
│  WEAKNESS     No built-in feature importance; less than RF for trees│
│  vs RF        RF per-split randomization >> BaggingClassifier+Tree  │
│  BEST FOR     Bagging non-tree estimators; custom sampling control  │
└──────────────────────────────────────────────────────────────────────┘

Bagging is the foundational ensemble idea — train many versions of the same model on different data, then average. Its genius is in recognizing that the bootstrap sample is a natural simulation of "what if I had a different training set?" and that averaging such simulations converges to the true expected prediction, removing variance without touching bias. Everything that came after — Random Forest, Extra-Trees, gradient boosting's subsampling — builds on this insight.

Powered by Forestry.md