Gradient Boosted Trees (sklearn GBM)

1. What Is sklearn Gradient Boosting?

sklearn provides two gradient boosting implementations:

Both implement the same fundamental algorithm — Forward Stagewise Additive Modeling with differentiable loss functions — but differ dramatically in how they find the best split at each node.

Understanding sklearn's GBM is the best way to understand gradient boosting from first principles before moving to XGBoost, LightGBM, and CatBoost.

Property GradientBoostingClassifier HistGradientBoostingClassifier
Algorithm Friedman GBM (exact) Histogram-based (approximate)
Speed Slow Fast
Large datasets ❌ Not suited ✅ Scales to millions
Missing values ❌ No ✅ Native
Categorical features ❌ Manual encoding ✅ Native (ordinal)
Monotonic constraints ❌ No ✅ Yes
Early stopping ⚠️ Via staged_predict ✅ Built-in
Staged predictions ✅ Yes ❌ No
Multi-class ✅ One tree per class/round ✅ Same

2. Two Implementations: GBC vs. HGBC

The critical split decision:

Dataset size < 10,000 samples:   GradientBoostingClassifier (or either)
Dataset size > 10,000 samples:   HistGradientBoostingClassifier
Missing values present:          HistGradientBoostingClassifier
Need monotonic constraints:      HistGradientBoostingClassifier
Need staged_predict:             GradientBoostingClassifier
Prefer XGBoost/LGB API parity:   HistGradientBoostingClassifier

For new projects on medium-to-large datasets, HGBC is the default choice within sklearn. GBC exists for small datasets and backward compatibility.


3. The Core Algorithm — Friedman's GBM

3.1 Pseudo-Residuals and the Forward Pass

Both implementations share the same mathematical core. Given a loss function L(y, F) and training data {(xᵢ, yᵢ)}:

Initialize the model with the optimal constant prediction:

F₀(x) = argmin_γ Σᵢ L(yᵢ, γ)

For log_loss (binary): F₀ = log(p̄ / (1 − p̄)) where p̄ is the training class rate. For MSE: F₀ = mean(y).

For each boosting round t = 1, ..., T:

  1. Compute pseudo-residuals (negative gradient of loss):
rᵢₜ = −[∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ)]_{F = Fₜ₋₁}
  1. Fit a regression tree hₜ to {(xᵢ, rᵢₜ)}.

  2. Compute optimal leaf values γⱼₜ for each leaf j (line search):

γⱼₜ = argmin_γ Σ_{xᵢ ∈ leaf j} L(yᵢ, Fₜ₋₁(xᵢ) + γ)
  1. Update:
Fₜ(x) = Fₜ₋₁(x) + α · hₜ(x)

3.2 Leaf Value Computation

For MSE loss, the optimal leaf value is simply the mean of residuals in the leaf — gradient and Hessian are trivial.

For log_loss (binary cross-entropy), the optimal leaf value requires a Newton step:

γⱼ = Σᵢ∈leaf rᵢ / Σᵢ∈leaf |rᵢ|(1 − |rᵢ|)

Where rᵢ = yᵢ − sigmoid(Fₜ₋₁(xᵢ)) are the residuals in probability space.

This is a first-order approximation. XGBoost uses the full second-order (Hessian) correction for a more accurate leaf value — one key reason XGBoost often outperforms sklearn's GBC.


3.3 Shrinkage

Each tree's contribution is scaled by the learning rate α:

Fₜ(x) = Fₜ₋₁(x) + α · hₜ(x)       α ∈ (0, 1]

Smaller α → more trees required for convergence → better generalization (at the cost of training time). The standard empirical finding:

α = 0.1,  n_estimators = 100:   Fast, moderate Accuracy
α = 0.05, n_estimators = 300:   Better
α = 0.01, n_estimators = 1000:  Often best, slow to train

4. Loss Functions in Detail

4.1 log_loss (Deviance)

The default for Classification. Binary cross-entropy:

L(y, F) = log(1 + exp(−2yF))    for y ∈ {−1, +1}

Or equivalently:

L(y, p) = −[y·log(p) + (1−y)·log(1−p)]    where p = sigmoid(F)

Pseudo-residuals:

rᵢ = yᵢ − sigmoid(Fₜ₋₁(xᵢ))   ← prediction errors in probability space

Interpretation: each tree is fit to the difference between true labels and current probability estimates. This is exactly Logistic Regression's gradient.


4.2 exponential

The AdaBoost loss:

L(y, F) = exp(−yF)

Pseudo-residuals:

rᵢ = yᵢ · exp(−yᵢ · Fₜ₋₁(xᵢ))

Setting loss='exponential' in GBC reproduces AdaBoost with Decision Trees as base learners. Rarely used — log_loss is more numerically stable and robust to noise.


4.3 Regression Losses

While GBC/HGBC are classifiers, their regression counterparts (GradientBoostingRegressor, HistGradientBoostingRegressor) expose the full loss menu:

Loss Robust to outliers Use case
squared_error Standard regression
absolute_error ✅ (heavy) When outliers are common
huber ✅ (moderate) Balanced robustness
quantile N/A Prediction intervals

5. Stochastic Gradient Boosting

Friedman (1999) proposed training each tree on a random subsample of the training data (without replacement). This is controlled by subsample:

GradientBoostingClassifier(subsample=0.8)   # Each tree uses 80% of training data

Effects:

GradientBoostingClassifier(subsample=0.8, n_iter_no_change=10)
# n_iter_no_change triggers early stopping using OOB samples when subsample < 1

Similarly, max_features controls how many features are considered at each split — identical to Random Forest's randomization:

GradientBoostingClassifier(max_features='sqrt')   # Random feature subsets per split

6. GradientBoostingClassifier — Internals

6.1 Tree Structure

GBC uses CART regression trees internally (always regression trees — even for Classification). Trees are built with:

The trees are grown using exact split finding — for each candidate feature, every unique threshold is evaluated. This is O(m · p · log m) per tree — slow for large datasets.


6.2 Split Finding

At each node, for each feature f and each threshold t:

Gain(f, t) = Impurity(parent) − [n_left/n · Impurity(left) + n_right/n · Impurity(right)]

For GBM, impurity is measured as the sum of squared pseudo-residuals — the node splits to maximally reduce this sum.

The best (f, t) pair is selected and the node is split. This is exact but scales as O(m · p) per node — prohibitive for large datasets.


6.3 Multi-Class Extension

For K > 2 classes, GBC trains K trees per round — one per class. Each tree fits the gradient of the multinomial log-loss for its class:

rᵢₖₜ = 𝟙[yᵢ = k] − P̂ₜ₋₁(y = k | xᵢ)    for class k

This is the difference between the indicator that the true class is k and the current predicted probability for class k.

Total trees = T × K. For K=10, n_estimators=200 → 2,000 trees. Multi-class GBT is expensive.


7. HistGradientBoostingClassifier — Internals

7.1 Histogram-Based Split Finding

The key innovation: instead of evaluating every unique threshold, HGBC first bins each feature into 255 discrete buckets (a histogram), then finds the best split over buckets.

Original values: [0.13, 0.87, 0.42, 0.19, 0.65, 0.33, ...]  (many unique values)
Binned values:   [12,   215,  89,   34,   163,  72,  ...]    (at most 255 bins)

Split finding then evaluates at most 255 thresholds per feature instead of potentially millions. For 100 features, that's 25,500 candidate splits per node — fast and cache-friendly.

Building the histogram:

This is the same idea that made LightGBM so fast, and HGBC achieves comparable speed within the sklearn API.


7.2 Native Missing Value Handling

HGBC handles NaN values without imputation. For each split:

Evaluate: send all NaN values to the LEFT child
Evaluate: send all NaN values to the RIGHT child
Choose whichever direction minimizes loss

The learned default direction for NaN values is stored per split — a principled, data-driven approach. At prediction time, missing values follow their learned default direction at each node.

# No imputation needed
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()
clf.fit(X_train_with_nans, y_train)   # Works directly

7.3 Monotonic Constraints

HGBC supports monotonic constraints — forcing the model to be monotonically increasing or decreasing with respect to specific features:

# Force model to be increasing in feature 0, decreasing in feature 2, unconstrained elsewhere
clf = HistGradientBoostingClassifier(
    monotonic_cst={0: 1, 2: -1}   # 1 = increasing, -1 = decreasing, 0 = none
)

At each node split, HGBC enforces that all leaf values in the left subtree are ≤ all leaf values in the right subtree (for increasing constraint). This is valuable for:


8. GBC vs. HGBC — Detailed Comparison

Property GradientBoostingClassifier HistGradientBoostingClassifier
Split algorithm Exact (all thresholds) Histogram (255 bins)
Training speed O(m·p·log m) per tree O(m·p) binning + O(255·p) splits
Memory O(m·p) O(255·p) after binning
Dataset size Up to ~10k rows well Millions of rows
Missing values ❌ Requires imputation ✅ Native NaN support
Categorical features ❌ Manual encoding categorical_features param
Monotonic constraints ❌ No ✅ Yes
Early stopping Via n_iter_no_change+subsample Via early_stopping + val set
Staged predictions staged_predict ❌ No
sample_weight ✅ Yes ✅ Yes
warm_Start ✅ Yes ✅ Yes
Interpretability Same (both use Decision Trees) Same
Parallelism Limited (within-tree) Better (histogram construction)

9. Hyperparameters — Deep Dive

9.1 GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

GradientBoostingClassifier(
    loss='log_loss',          # 'log_loss' or 'exponential'
    learning_rate=0.1,        # Shrinkage — most important param
    n_estimators=100,         # Number of trees
    subsample=1.0,            # Fraction of samples per tree (< 1 = stochastic GBM)
    criterion='friedman_mse', # Split quality: 'friedman_mse' or 'squared_error'
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,              # Depth of each tree — default 3 is good
    max_features=None,        # Feature subset per split: None, 'sqrt', 'log2', int
    max_leaf_nodes=None,      # Alternative to max_depth
    min_impurity_decrease=0.0,
    n_iter_no_change=None,    # Early stopping patience
    validation_fraction=0.1,  # Val fraction for early stopping
    tol=1e-4,                 # Min improvement threshold for early stopping
    ccp_alpha=0.0,            # Post-pruning complexity parameter
    warm_Start=False,         # Reuse previous fit (add more trees)
    random_state=42,
    verbose=0
)

Priority order for tuning:

1. learning_rate + n_estimators    (together — smaller LR needs more trees)
2. max_depth or max_leaf_nodes     (tree complexity)
3. subsample                       (regularization + speed)
4. min_samples_leaf                (regularization)
5. max_features                    (additional randomization)

9.2 HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier

HistGradientBoostingClassifier(
    loss='log_loss',             # Only 'log_loss' for binary; 'auto' for multi-class
    learning_rate=0.1,
    max_iter=100,                # n_estimators equivalent
    max_leaf_nodes=31,           # Primary complexity control (not max_depth!)
    max_depth=None,              # Optional additional depth limit
    min_samples_leaf=20,         # Default is 20 (not 1 like GBC)
    l2_regularization=0.0,       # L2 penalty on leaf values
    max_bins=255,                # Histogram bins (reduce for speed/regularization)
    categorical_features=None,   # List of categorical column indices
    monotonic_cst=None,          # Monotonic constraints dict
    interaction_cst=None,        # Restrict feature interactions
    warm_Start=False,
    early_stopping='auto',       # 'auto', True, False
    scoring='loss',              # Metric for early stopping
    validation_fraction=0.1,
    n_iter_no_change=10,
    tol=1e-7,
    random_state=42,
    verbose=0
)

Key difference from GBC: max_leaf_nodes=31 (not max_depth=3) is the primary complexity control. 31 leaves ≈ depth-5 tree.


10. Staged Prediction and Learning Curves

GBC's staged_predict / staged_predict_proba allows you to evaluate the model at each boosting round without retraining — invaluable for finding optimal n_estimators:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
import numpy as np

clf = GradientBoostingClassifier(n_estimators=500, learning_rate=0.1,
                                  subsample=0.8, random_state=42)
clf.fit(X_train, y_train)

# Compute loss at each stage
train_scores = []
val_scores   = []

for proba in clf.staged_predict_proba(X_train):
    train_scores.append(log_loss(y_train, proba))

for proba in clf.staged_predict_proba(X_val):
    val_scores.append(log_loss(y_val, proba))

plt.plot(train_scores, label='Train')
plt.plot(val_scores,   label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.legend()
plt.title('GBM Learning Curve — Find Optimal n_estimators')

optimal_n = np.argmin(val_scores) + 1
print(f"Optimal n_estimators: {optimal_n}")

HGBC early stopping (simpler):

clf = HistGradientBoostingClassifier(
    max_iter=1000,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    verbose=1
)
clf.fit(X_train, y_train)
print(f"Stopped at iteration: {clf.n_iter_}")

11. Feature Importance & Interpretability

# Impurity-based importance (both implementations)
importances = clf.feature_importances_

# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(clf, X_val, y_val, n_repeats=20, n_jobs=-1)

# Partial dependence plots (built into sklearn)
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(clf, X_train, features=[0, 1, (0, 1)])

# SHAP (best option)
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)

Note: SHAP's TreeExplainer supports sklearn GBM but may be slower than for XGBoost/LightGBM (less optimized integration). Use HGBC or XGBoost for production SHAP pipelines.


12. The Bias-Variance Profile

Learning rate:  Small α → low variance, higher bias (needs more trees to converge)
max_depth:      Deeper → lower bias, higher variance per tree
n_estimators:   More trees → lower bias (without shrinkage, overfits)
subsample:      < 1.0 → reduces variance (stochastic averaging)
min_samples_leaf: Higher → higher bias, lower variance (smoothing)

The optimal configuration typically has:

Low learning rate (0.01–0.05) + many trees (500–2000) + moderate depth (3–6)
+ subsample (0.7–0.9) + min_samples_leaf tuned to dataset size

13. Assumptions

Assumption Notes
Differentiable loss Required for gradient computation
No feature scaling required Tree splits are scale-invariant
IID samples Boosting assumes fixed distribution; concept drift degrades performance
No distributional assumption Non-parametric; no normality required
No extrapolation Cannot predict beyond training feature ranges

14. Advantages

✅ Reference Implementation

sklearn GBM is the most readable, debuggable gradient boosting implementation. Ideal for learning the algorithm's internals — every step is transparent.

staged_predict for Diagnostics

Unique to GBC — plot learning curves, detect overfitting, find optimal rounds — all from a single fit.

✅ HGBC Competitive with XGBoost/LightGBM on Small-Medium Data

On datasets up to ~500k rows, HGBC often matches XGBoost and LightGBM in Accuracy while staying within the sklearn API.

✅ No Extra Dependencies

Part of sklearn — no additional installation. Critical for deployments where package management is constrained.

✅ Full sklearn Pipeline Integration

Both implementations work seamlessly with Pipeline, GridSearchCV, cross_val_score — the entire sklearn ecosystem.

✅ HGBC Native Categorical & Missing Value Support

Eliminates the most tedious preprocessing steps for tabular data.

✅ Monotonic Constraints (HGBC)

Rare feature among sklearn estimators — essential for regulated domains (finance, medicine).


15. Drawbacks & Limitations

❌ GBC Is Slow on Large Datasets

O(m · p · log m) per tree — impractical beyond 50,000 rows. XGBoost and LightGBM are 10–100x faster.

❌ No GPU Support

Neither GBC nor HGBC supports GPU training. For large datasets needing GPU acceleration, use XGBoost or LightGBM.

❌ No Native Sparse Matrix Support (GBC)

GBC requires dense arrays. For sparse text data, use linear models or XGBoost.

❌ Limited Customization

Compared to XGBoost/LightGBM, sklearn's API has fewer knobs — no custom objectives, no custom metrics, limited callback support.

❌ No Second-Order Approximation (GBC)

GBC uses first-order pseudo-residuals for tree fitting. XGBoost's second-order (Hessian) leaf values are more accurate and better regularized.

❌ HGBC Lacks staged_predict

The speed gain of histogram methods comes at the cost of GBC's staged prediction capability.


16. sklearn GBM vs. XGBoost vs. LightGBM vs. CatBoost

Property sklearn GBC sklearn HGBC XGBoost LightGBM CatBoost
Speed (large data) ❌ Very slow ✅ Fast ✅ Fast ✅✅ Fastest ✅ Fast
GPU support
Missing values ❌ GBC only
Categorical ⚠️ Ordinal ⚠️ Ordinal ✅ Native
Monotonic cst.
2nd order ❌ (GBC) ✅ (HGBC) ❌ (1st order)
Custom loss
staged_predict ✅ GBC only ✅ (iter range)
Extra install ❌ No ❌ No ✅ pip install ✅ pip install ✅ pip install
sklearn Pipeline ✅ (wrapper) ✅ (wrapper) ✅ (wrapper)
Best for Small data, learning Medium data General purpose Large data Categoricals

17. Practical Tips & Gotchas

Choosing Between GBC and HGBC

n_samples = X_train.shape[0]

if n_samples < 10_000:
    # GBC is fine; use staged_predict for diagnostics
    from sklearn.ensemble import GradientBoostingClassifier as GBM
elif n_samples < 1_000_000:
    # HGBC is the right choice
    from sklearn.ensemble import HistGradientBoostingClassifier as GBM
else:
    # Use LightGBM or [[XGBoost]] instead
    import LightGBM as lgb

Canonical GBC Setup

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
import numpy as np

clf = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    min_samples_leaf=10,
    random_state=42
)
clf.fit(X_train, y_train)

# Find optimal rounds from staged learning curve
val_losses = [log_loss(y_val, p) for p in clf.staged_predict_proba(X_val)]
best_n = np.argmin(val_losses) + 1

# Retrain with optimal n_estimators
final = GradientBoostingClassifier(
    n_estimators=best_n,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    min_samples_leaf=10,
    random_state=42
)
final.fit(X_train, y_train)

Canonical HGBC Setup

from sklearn.ensemble import HistGradientBoostingClassifier

clf = HistGradientBoostingClassifier(
    max_iter=1000,
    learning_rate=0.05,
    max_leaf_nodes=63,
    min_samples_leaf=20,
    l2_regularization=0.1,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    random_state=42
)
clf.fit(X_train, y_train)
print(f"Best iteration: {clf.n_iter_}")

Native Categorical Features (HGBC)

import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier

# Mark which columns are categorical
categorical_mask = np.zeros(X_train.shape[1], dtype=bool)
categorical_mask[[2, 5, 7]] = True   # Columns 2, 5, 7 are categorical

clf = HistGradientBoostingClassifier(
    categorical_features=categorical_mask
)
# Categorical columns must be integer-encoded (not one-hot)

Warm [[Start]] — Add Trees Without Retraining

clf = GradientBoostingClassifier(n_estimators=100, warm_[[Start]]=True)
clf.fit(X_train, y_train)

# Add 100 more trees without [[Start]]ing over
clf.n_estimators = 200
clf.fit(X_train, y_train)

18. When to Use It

Use GradientBoostingClassifier when:

Use HistGradientBoostingClassifier when:

Prefer CatBoost when:


Summary

┌─────────────────────────────────────────────────────────────────────┐
│          SKLEARN GRADIENT BOOSTING AT A GLANCE                      │
├─────────────────────────────────────────────────────────────────────┤
│  GBC CORE     Exact split finding, staged_predict, Friedman GBM    │
│  HGBC CORE    Histogram bins, native NaN/categorical, monotonic     │
│  TRAINING     F_t = F_{t-1} + α·h_t  (fit tree to pseudo-residuals)│
│  DEFAULT      GBC: max_depth=3, LR=0.1 | HGBC: max_leaf=31, LR=0.1│
│  BEST ASSET   GBC: staged_predict | HGBC: speed + missing values   │
│  LIMITATION   GBC: slow on large data | Both: no GPU               │
│  vs [[XGBoost]]   Less flexible, slower, but zero extra dependencies   │
│  BEST FOR     sklearn pipelines, medium data, monotonic constraints │
└─────────────────────────────────────────────────────────────────────┘

sklearn's gradient boosting implementations are where most practitioners first encounter the algorithm, and for good reason: GBC's code is readable, its staged predictions are diagnostic gold, and HGBC brings the algorithm firmly into the modern era without leaving the sklearn ecosystem. They are not the fastest or most flexible gradient boosting tools — but they are the best [[Start]]ing point, and for many production workloads, they are entirely sufficient.

Powered by Forestry.md