BaggingClassifier
1. What Is Bagging?
Bagging (Bootstrap Aggregating), introduced by Leo Breiman in 1996, is the general framework for training multiple instances of the same base learner on different bootstrap samples of the training data and aggregating their predictions.
sklearn's BaggingClassifier is the direct, general-purpose implementation of this framework — it can apply bagging to any sklearn-compatible classifier. Random Forest and Extra-Trees are specialized, highly optimized instances of bagging restricted to decision trees; BaggingClassifier generalizes bagging to arbitrary base learners.
| Property | Value |
|---|---|
| Introduced | Leo Breiman, 1996 |
| Task | Classification and Regression |
| Base learner | Any sklearn estimator |
| Key mechanism | Bootstrap samples + prediction aggregation |
| Primary effect | Variance reduction |
| sklearn class | sklearn.ensemble.BaggingClassifier |
2. Historical Context — Breiman 1996
Breiman's 1996 paper "Bagging Predictors" provided the first principled, general ensemble framework with three key contributions:
- Identified instability as the driver of variance — classifiers that change substantially with small data perturbations (trees, neural networks) benefit most from bagging; stable ones (linear models) benefit little
- Proved variance reduction — the expected squared error of the bagged ensemble never exceeds the average squared error of individual models
- Bootstrap as the natural perturbation — sampling with replacement simulates having different training datasets from the same data-generating process
Random Forest (Breiman 2001) is bagging's most successful application, but the framework applies to any classifier.
3. The Core Principle — Why Averaging Helps
Unstable vs. Stable Estimators
Bagging is most beneficial for unstable estimators — classifiers whose predictions change substantially when trained on slightly different data:
Unstable (bagging helps greatly):
Unpruned decision trees → small data change → completely different tree structure
k-NN with small k → small data change → different nearest neighbors
Neural networks → random initialization → different local optima
Stable (bagging helps little):
Linear/logistic regression → robust to small data perturbations
Linear SVM → convex optimization → nearly identical solutions
k-NN with large k → large neighborhood averages perturbations
Variance Decomposition
For any prediction f̂(x) trained on dataset D, expected test error decomposes as:
E_D[(y − f̂(x))²] = Bias² + Variance + Irreducible Noise
Bias² = (E_D[f̂(x)] − f*(x))²
Variance = E_D[(f̂(x) − E_D[f̂(x)])²]
Bagging approximates E_D[f̂(x)] by averaging over bootstrap samples:
f̂_bagged(x) = (1/B) Σ_b f̂_b(x) ≈ E_D[f̂(x)]
Effect on Bias: Negligible — bootstrap samples have the same distribution as D.
Effect on Variance: Reduced from σ² toward ρσ² (the correlation floor).
4. Mathematical Foundation
Bootstrap Statistics
P(sample i in bootstrap D_b) = 1 − (1 − 1/m)^m → 1 − 1/e ≈ 63.2%
P(sample i is OOB for tree b) ≈ 36.8%
E[# appearances of i in D_b] = 1 (Poisson(1) distribution)
Variance Reduction Formula
With B models, each having variance σ² and pairwise correlation ρ:
Var(f̂_bagged) = ρσ² + (1−ρ)σ²/B
As B → ∞: Var → ρσ² (the correlation floor)
For unstable base learners, different bootstrap samples produce very different models → ρ is small → large variance reduction. For stable base learners, ρ ≈ 1 → bagging provides no benefit.
Prediction
Soft voting (default — averages probabilities):
f̂_bagged(x) = argmax_k (1/B) Σ_b P̂_b(y=k | x)
Hard voting (majority of class labels):
f̂_bagged(x) = argmax_k #{b : f̂_b(x) = k}
Soft voting is almost always superior — it uses more information per prediction.
5. The BaggingClassifier Algorithm
Input: Data D, base estimator clf, B, max_samples, max_features,
bootstrap (samples), bootstrap_features (features)
parallel for b = 1 to B:
# Sample selection
if bootstrap:
S_b = draw max_samples from D WITH replacement
else:
S_b = draw max_samples from D WITHOUT replacement (Pasting)
# Feature selection
if bootstrap_features:
F_b = draw max_features features WITH replacement
else:
F_b = draw max_features features WITHOUT replacement
# Train on (S_b restricted to F_b)
clf_b = clone(clf).fit(S_b[:, F_b], labels_b)
ensemble.append((clf_b, F_b))
Predict(x):
probas = mean([clf_b.predict_proba(x[F_b]) for clf_b, F_b in ensemble])
return argmax(probas)
Complexity: O(B × cost_of_single_clf_fit) — embarrassingly parallel.
6. Relationship to Random Forest and Extra-Trees
BaggingClassifier(DecisionTreeClassifier) is not equivalent to Random Forest:
| Property | BaggingClassifier(DecisionTree) | Random Forest |
|---|---|---|
| Feature randomization | Per estimator (if max_features<1) | Per split (at every node) |
| Split finding | Optimal over all features at each node | Optimal over √p subset at each node |
| Tree correlation | Higher (same features all splits) | Lower (different features each split) |
| Accuracy | Lower | Higher |
Random Forest applies feature randomization at each individual split — a far stronger decorrelation mechanism. BaggingClassifier only subsamples features once per tree. RF almost always outperforms BaggingClassifier(DecisionTree).
To properly approximate RF within BaggingClassifier:
BaggingClassifier(
estimator=DecisionTreeClassifier(max_features='sqrt'), # Per-split subset inside tree
n_estimators=500,
bootstrap=True,
n_jobs=-1
)
# Still not identical to RF but closer — the DecisionTreeClassifier handles per-split subsets
7. Bagging with Different Base Learners
Decision Trees (Approximate RF)
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_tree = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=None),
n_estimators=200, bootstrap=True, n_jobs=-1
)
Use when you need custom tree parameters not exposed by RandomForestClassifier.
k-Nearest Neighbors — Highly Recommended
kNN with small k is extremely unstable → bagging helps enormously:
from sklearn.neighbors import KNeighborsClassifier
bag_knn = BaggingClassifier(
estimator=KNeighborsClassifier(n_neighbors=5),
n_estimators=50,
max_samples=0.8,
bootstrap=True,
n_jobs=-1
)
Bagged kNN smooths the extremely jagged decision boundaries of individual kNN classifiers — often matching the accuracy of kNN with much larger k while being faster at prediction time.
SVMs
Useful when SVM accuracy must be improved but training data is large (each SVM trains on a small bootstrap):
from sklearn.svm import SVC
bag_svm = BaggingClassifier(
estimator=SVC(kernel='rbf', C=1.0, probability=True),
n_estimators=20, # SVMs are slow — keep small
max_samples=0.5, # Smaller bootstraps for speed
bootstrap=True,
n_jobs=-1
)
Each SVM trains on ~50% of data → faster and different support vectors → diversity. Rarely the right choice over GBT in practice.
Neural Networks (MLPs)
MLPs are unstable due to random initialization and non-convex optimization:
from sklearn.neural_network import MLPClassifier
bag_mlp = BaggingClassifier(
estimator=MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300),
n_estimators=10, # MLPs are slow
max_samples=0.8,
bootstrap=True,
n_jobs=-1
)
Each MLP converges to a different local minimum — averaging over these reduces variance from random initialization.
Logistic Regression — Rarely Beneficial
Logistic regression is stable → bagging provides minimal benefit:
# Not recommended — LR is stable, bagging wastes compute
from sklearn.linear_model import LogisticRegression
bag_lr = BaggingClassifier(estimator=LogisticRegression(), n_estimators=50)
# Bootstrap solutions are nearly identical → ρ ≈ 0.9+ → Var reduction ≈ 0
8. Sampling Strategies: Pasting, Subspaces, Patches
| Method | bootstrap (samples) | bootstrap_features | max_samples | max_features |
|---|---|---|---|---|
| Bagging | True | False | 1.0 | 1.0 |
| Pasting | False | False | < 1.0 | 1.0 |
| Random Subspaces | False | False | 1.0 | < 1.0 |
| Random Patches | True | False | < 1.0 | < 1.0 |
Pasting — samples without replacement:
bag_paste = BaggingClassifier(
estimator=DecisionTreeClassifier(),
max_samples=0.7, bootstrap=False, n_jobs=-1
)
Random Subspaces (Ho 1998) — all samples, random features:
bag_subspace = BaggingClassifier(
estimator=DecisionTreeClassifier(),
max_samples=1.0, bootstrap=False,
max_features=0.6, bootstrap_features=False, n_jobs=-1
)
Random Patches — both sample and feature subsampling:
bag_patches = BaggingClassifier(
estimator=DecisionTreeClassifier(),
max_samples=0.7, bootstrap=True,
max_features=0.6, bootstrap_features=False, n_jobs=-1
)
# Most decorrelated; highest diversity; best for very high-dimensional data
9. Out-of-Bag Evaluation
When bootstrap=True, ~36.8% of samples are OOB per tree — free validation:
bag = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=200,
bootstrap=True,
oob_score=True,
n_jobs=-1
)
bag.fit(X_train, y_train)
print(f"OOB accuracy: {bag.oob_score_:.4f}")
# bag.oob_decision_function_: shape (m, K) — per-sample OOB probabilities
10. Bias-Variance Profile
| Base Learner | Single Variance | ρ (bagged) | Bagging Benefit |
|---|---|---|---|
| Unpruned decision tree | Very high | ~0.2–0.5 | ✅✅ Enormous |
| kNN (k=5) | High | ~0.3–0.6 | ✅ Large |
| MLP (random init) | High | ~0.4–0.6 | ✅ Moderate |
| SVM (RBF) | Medium | ~0.5–0.7 | ✅ Moderate |
| Logistic Regression | Low | ~0.9+ | ❌ Minimal |
The ρ values are approximate — they depend heavily on the dataset and hyperparameters.
11. Hyperparameters — Complete Reference
from sklearn.ensemble import BaggingClassifier
BaggingClassifier(
estimator=None, # Base learner (default: DecisionTreeClassifier())
n_estimators=10, # Number of base estimators — set 100–500
max_samples=1.0, # Samples per estimator (int or float fraction)
max_features=1.0, # Features per estimator (int or float fraction)
bootstrap=True, # Sample WITH replacement
bootstrap_features=False, # Features WITH replacement
oob_score=False, # Requires bootstrap=True
warm_start=False, # Add estimators incrementally
n_jobs=-1,
random_state=42,
verbose=0
)
Priority:
1. estimator: Choose base learner for your data/problem
2. n_estimators: 100–500 for production; more for unstable base learners
3. max_samples: 1.0 (standard); 0.5–0.8 to speed training or add regularization
4. max_features: 1.0 (standard); < 1.0 for random patches / high-dim data
5. bootstrap: True for bagging+OOB; False for pasting
12. Assumptions
| Assumption | Notes |
|---|---|
| IID samples | Bootstrap theory requires exchangeable samples |
| Unstable base learner | Bagging works best for high-variance, low-bias estimators |
| Sufficient B | B ≥ 100 for stable OOB estimate and stable ensemble |
| Base learner correctness | Each base learner must be better than random chance |
13. Advantages
✅ Completely General
Wraps any sklearn-compatible estimator — the only ensemble method that applies bagging to kNN, SVM, MLP, custom classifiers, etc.
✅ Proven Variance Reduction
Mathematically guaranteed to reduce variance for any unstable estimator, without increasing bias.
✅ Free OOB Validation
When bootstrap=True, no held-out set needed.
✅ All Sampling Strategies in One API
Bagging, Pasting, Random Subspaces, Random Patches — four distinct methods through the same interface.
✅ Fully Parallelizable
All B estimators are independent — linear CPU scaling.
✅ Simple and Transparent
Easy to understand, easy to debug, easy to inspect individual models.
14. Drawbacks & Limitations
❌ Less Specialized Than Random Forest
For trees, RF's per-split feature randomization produces better decorrelation than BaggingClassifier's per-estimator sampling.
❌ Slow for Expensive Base Learners
100 SVMs or 100 MLPs is prohibitively expensive — keep n_estimators small or use smaller max_samples.
❌ No Built-in Feature Importance
Must aggregate manually from individual estimators.
❌ Memory Intensive
Stores B complete model objects.
❌ Minimal Benefit for Stable Learners
Bagging logistic regression or linear SVM is a waste of compute.
15. Bagging vs. Boosting vs. Stacking
| Property | Bagging | Boosting | Stacking |
|---|---|---|---|
| Training order | Parallel | Sequential | Two sequential levels |
| Error targeted | Variance | Bias | Both |
| Sample weighting | Bootstrap (uniform) | Error-driven | Not applicable |
| Base learner type | Same model | Same model | Different models |
| Combination | Average / vote | Weighted vote | Meta-learner |
| Noise robustness | ✅ High | ⚠️ Variable | ✅ Moderate |
| Overfit risk | Low | Medium–High | Medium |
16. Practical Tips & Gotchas
Bagged kNN — Best Non-Tree Application
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
bag_knn = BaggingClassifier(
estimator=KNeighborsClassifier(n_neighbors=5),
n_estimators=50,
max_samples=0.8,
bootstrap=True,
n_jobs=-1
)
bag_knn.fit(X_train, y_train)
print(f"Accuracy: {bag_knn.score(X_test, y_test):.4f}")
Access Individual Estimators
bag.fit(X_train, y_train)
# Inspect each fitted estimator
for i, (clf, features) in enumerate(zip(bag.estimators_, bag.estimators_features_)):
preds = clf.predict(X_test[:, features])
print(f"Model {i}: {(preds == y_test).mean():.4f}")
# Which training samples went to each estimator
for i, indices in enumerate(bag.estimators_samples_):
print(f"Model {i}: {len(set(indices))} unique samples")
Random Patches for High-Dimensional Data
bag_patches = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=5),
n_estimators=200,
max_samples=0.6,
max_features=0.4,
bootstrap=True,
bootstrap_features=False,
n_jobs=-1
)
Manual Feature Importance from Bagged Trees
import numpy as np
# Only works when base estimator has feature_importances_
importances = np.mean(
[clf.feature_importances_ for clf in bag.estimators_],
axis=0
)
std = np.std(
[clf.feature_importances_ for clf in bag.estimators_],
axis=0
)
# Now importances is the mean MDI across all bagged trees
17. When to Use It
Use BaggingClassifier when:
- Your base learner is not a decision tree but you want bagging (kNN, SVM, MLP, custom)
- You need fine-grained sampling control (pasting, subspaces, patches)
- You want to apply bagging to a custom sklearn estimator
- You need to inspect or aggregate individual estimators programmatically
- You want OOB evaluation for a non-tree base learner
Use Random Forest instead when:
- Your base learner is a decision tree — RF's per-split feature randomization is always better
- You need built-in feature importances efficiently computed
Do not use BaggingClassifier when:
- The base learner is stable (logistic regression, linear SVM)
- Maximum accuracy is needed — GBT dominates for tabular tasks
Summary
┌──────────────────────────────────────────────────────────────────────┐
│ BAGGING CLASSIFIER AT A GLANCE │
├──────────────────────────────────────────────────────────────────────┤
│ CORE IDEA Train B models on B bootstrap samples, average │
│ PRIMARY Variance reduction for unstable estimators │
│ VARIANCE Var = ρσ² + (1−ρ)σ²/B → ρσ² as B→∞ │
│ BASE LEARNER Any sklearn estimator (kNN, SVM, MLP, Tree, custom) │
│ SAMPLING Bagging | Pasting | Random Subspaces | Random Patches │
│ OOB Free validation when bootstrap=True │
│ STRENGTH Generality, parallelism, principled, proven theory │
│ WEAKNESS No built-in feature importance; less than RF for trees│
│ vs RF RF per-split randomization >> BaggingClassifier+Tree │
│ BEST FOR Bagging non-tree estimators; custom sampling control │
└──────────────────────────────────────────────────────────────────────┘
Bagging is the foundational ensemble idea — train many versions of the same model on different data, then average. Its genius is in recognizing that the bootstrap sample is a natural simulation of "what if I had a different training set?" and that averaging such simulations converges to the true expected prediction, removing variance without touching bias. Everything that came after — Random Forest, Extra-Trees, gradient boosting's subsampling — builds on this insight.