Gradient Boosted Trees (sklearn GBM)

1. What Is sklearn Gradient Boosting?

sklearn provides two gradient boosting implementations:

GradientBoostingClassifier (GBC) — the original Friedman (2001) algorithm, implemented in pure Python/Cython. Slower but fully transparent and debuggable.
HistGradientBoostingClassifier (HGBC) — a modern histogram-based implementation inspired by LightGBM, orders of magnitude faster on large datasets.

Both implement the same fundamental algorithm — Forward Stagewise Additive Modeling with differentiable loss functions — but differ dramatically in how they find the best split at each node.

Understanding sklearn's GBM is the best way to understand gradient boosting from first principles before moving to XGBoost, LightGBM, and CatBoost.

Property	GradientBoostingClassifier	HistGradientBoostingClassifier
Algorithm	Friedman GBM (exact)	Histogram-based (approximate)
Speed	Slow	Fast
Large datasets	❌ Not suited	✅ Scales to millions
Missing values	❌ No	✅ Native
Categorical features	❌ Manual encoding	✅ Native (ordinal)
Monotonic constraints	❌ No	✅ Yes
Early stopping	⚠️ Via staged_predict	✅ Built-in
Staged predictions	✅ Yes	❌ No
Multi-class	✅ One tree per class/round	✅ Same

2. Two Implementations: GBC vs. HGBC

The critical split decision:

Dataset size < 10,000 samples:   GradientBoostingClassifier (or either)
Dataset size > 10,000 samples:   HistGradientBoostingClassifier
Missing values present:          HistGradientBoostingClassifier
Need monotonic constraints:      HistGradientBoostingClassifier
Need staged_predict:             GradientBoostingClassifier
Prefer XGBoost/LGB API parity:   HistGradientBoostingClassifier

For new projects on medium-to-large datasets, HGBC is the default choice within sklearn. GBC exists for small datasets and backward compatibility.

3. The Core Algorithm — Friedman's GBM

3.1 Pseudo-Residuals and the Forward Pass

Both implementations share the same mathematical core. Given a loss function L(y, F) and training data {(xᵢ, yᵢ)}:

Initialize the model with the optimal constant prediction:

F₀(x) = argmin_γ Σᵢ L(yᵢ, γ)

For log_loss (binary): F₀ = log(p̄ / (1 − p̄)) where p̄ is the training class rate. For MSE: F₀ = mean(y).

For each boosting round t = 1, ..., T:

Compute pseudo-residuals (negative gradient of loss):

rᵢₜ = −[∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ)]_{F = Fₜ₋₁}

Fit a regression tree hₜ to {(xᵢ, rᵢₜ)}.
Compute optimal leaf values γⱼₜ for each leaf j (line search):

γⱼₜ = argmin_γ Σ_{xᵢ ∈ leaf j} L(yᵢ, Fₜ₋₁(xᵢ) + γ)

Update:

Fₜ(x) = Fₜ₋₁(x) + α · hₜ(x)

3.2 Leaf Value Computation

For MSE loss, the optimal leaf value is simply the mean of residuals in the leaf — gradient and Hessian are trivial.

For log_loss (binary cross-entropy), the optimal leaf value requires a Newton step:

γⱼ = Σᵢ∈leaf rᵢ / Σᵢ∈leaf |rᵢ|(1 − |rᵢ|)

Where rᵢ = yᵢ − sigmoid(Fₜ₋₁(xᵢ)) are the residuals in probability space.

This is a first-order approximation. XGBoost uses the full second-order (Hessian) correction for a more accurate leaf value — one key reason XGBoost often outperforms sklearn's GBC.

3.3 Shrinkage

Each tree's contribution is scaled by the learning rate α:

Fₜ(x) = Fₜ₋₁(x) + α · hₜ(x)       α ∈ (0, 1]

Smaller α → more trees required for convergence → better generalization (at the cost of training time). The standard empirical finding:

α = 0.1,  n_estimators = 100:   Fast, moderate Accuracy
α = 0.05, n_estimators = 300:   Better
α = 0.01, n_estimators = 1000:  Often best, slow to train

4. Loss Functions in Detail

4.1 log_loss (Deviance)

The default for Classification. Binary cross-entropy:

L(y, F) = log(1 + exp(−2yF))    for y ∈ {−1, +1}

Or equivalently:

L(y, p) = −[y·log(p) + (1−y)·log(1−p)]    where p = sigmoid(F)

Pseudo-residuals:

rᵢ = yᵢ − sigmoid(Fₜ₋₁(xᵢ))   ← prediction errors in probability space

Interpretation: each tree is fit to the difference between true labels and current probability estimates. This is exactly Logistic Regression's gradient.

4.2 exponential

The AdaBoost loss:

L(y, F) = exp(−yF)

Pseudo-residuals:

rᵢ = yᵢ · exp(−yᵢ · Fₜ₋₁(xᵢ))

Setting loss='exponential' in GBC reproduces AdaBoost with Decision Trees as base learners. Rarely used — log_loss is more numerically stable and robust to noise.

4.3 Regression Losses

While GBC/HGBC are classifiers, their regression counterparts (GradientBoostingRegressor, HistGradientBoostingRegressor) expose the full loss menu:

Loss	Robust to outliers	Use case
`squared_error`	❌	Standard regression
`absolute_error`	✅ (heavy)	When outliers are common
`huber`	✅ (moderate)	Balanced robustness
`quantile`	N/A	Prediction intervals

5. Stochastic Gradient Boosting

Friedman (1999) proposed training each tree on a random subsample of the training data (without replacement). This is controlled by subsample:

GradientBoostingClassifier(subsample=0.8)   # Each tree uses 80% of training data

Effects:

Variance reduction — different trees see different data → more decorrelated
Speed — fewer samples per tree
Regularization — harder to overfit any single training sample
Out-of-bag estimate — held-out 20% can be used for free validation

GradientBoostingClassifier(subsample=0.8, n_iter_no_change=10)
# n_iter_no_change triggers early stopping using OOB samples when subsample < 1

Similarly, max_features controls how many features are considered at each split — identical to Random Forest's randomization:

GradientBoostingClassifier(max_features='sqrt')   # Random feature subsets per split

6. GradientBoostingClassifier — Internals

6.1 Tree Structure

GBC uses CART regression trees internally (always regression trees — even for Classification). Trees are built with:

max_depth: default 3 — shallow trees are the norm
min_samples_split, min_samples_leaf: pruning
max_leaf_nodes: alternative to max_depth (limits total leaves directly)

The trees are grown using exact split finding — for each candidate feature, every unique threshold is evaluated. This is O(m · p · log m) per tree — slow for large datasets.

6.2 Split Finding

At each node, for each feature f and each threshold t:

Gain(f, t) = Impurity(parent) − [n_left/n · Impurity(left) + n_right/n · Impurity(right)]

For GBM, impurity is measured as the sum of squared pseudo-residuals — the node splits to maximally reduce this sum.

The best (f, t) pair is selected and the node is split. This is exact but scales as O(m · p) per node — prohibitive for large datasets.

6.3 Multi-Class Extension

For K > 2 classes, GBC trains K trees per round — one per class. Each tree fits the gradient of the multinomial log-loss for its class:

rᵢₖₜ = 𝟙[yᵢ = k] − P̂ₜ₋₁(y = k | xᵢ)    for class k

This is the difference between the indicator that the true class is k and the current predicted probability for class k.

Total trees = T × K. For K=10, n_estimators=200 → 2,000 trees. Multi-class GBT is expensive.

7. HistGradientBoostingClassifier — Internals

7.1 Histogram-Based Split Finding

The key innovation: instead of evaluating every unique threshold, HGBC first bins each feature into 255 discrete buckets (a histogram), then finds the best split over buckets.

Original values: [0.13, 0.87, 0.42, 0.19, 0.65, 0.33, ...]  (many unique values)
Binned values:   [12,   215,  89,   34,   163,  72,  ...]    (at most 255 bins)

Split finding then evaluates at most 255 thresholds per feature instead of potentially millions. For 100 features, that's 25,500 candidate splits per node — fast and cache-friendly.

Building the histogram:

Binning is computed once at the Start of training — O(m · p)
Each split evaluates histogram sums — O(255 · p) per node
Node histograms can be computed from parent's histogram minus sibling's histogram — O(255) subtraction

This is the same idea that made LightGBM so fast, and HGBC achieves comparable speed within the sklearn API.

7.2 Native Missing Value Handling

HGBC handles NaN values without imputation. For each split:

Evaluate: send all NaN values to the LEFT child
Evaluate: send all NaN values to the RIGHT child
Choose whichever direction minimizes loss

The learned default direction for NaN values is stored per split — a principled, data-driven approach. At prediction time, missing values follow their learned default direction at each node.

# No imputation needed
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()
clf.fit(X_train_with_nans, y_train)   # Works directly

7.3 Monotonic Constraints

HGBC supports monotonic constraints — forcing the model to be monotonically increasing or decreasing with respect to specific features:

# Force model to be increasing in feature 0, decreasing in feature 2, unconstrained elsewhere
clf = HistGradientBoostingClassifier(
    monotonic_cst={0: 1, 2: -1}   # 1 = increasing, -1 = decreasing, 0 = none
)

At each node split, HGBC enforces that all leaf values in the left subtree are ≤ all leaf values in the right subtree (for increasing constraint). This is valuable for:

Credit scoring (more income → lower risk, monotonically)
Pricing models (higher quantity → lower unit price)
Medical dosing (more drug → more effect, expected monotonically)

8. GBC vs. HGBC — Detailed Comparison

Property	GradientBoostingClassifier	HistGradientBoostingClassifier
Split algorithm	Exact (all thresholds)	Histogram (255 bins)
Training speed	O(m·p·log m) per tree	O(m·p) binning + O(255·p) splits
Memory	O(m·p)	O(255·p) after binning
Dataset size	Up to ~10k rows well	Millions of rows
Missing values	❌ Requires imputation	✅ Native NaN support
Categorical features	❌ Manual encoding	✅ `categorical_features` param
Monotonic constraints	❌ No	✅ Yes
Early stopping	Via `n_iter_no_change`+subsample	Via `early_stopping` + val set
Staged predictions	✅ `staged_predict`	❌ No
sample_weight	✅ Yes	✅ Yes
warm_Start	✅ Yes	✅ Yes
Interpretability	Same (both use Decision Trees)	Same
Parallelism	Limited (within-tree)	Better (histogram construction)

9. Hyperparameters — Deep Dive

9.1 GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

GradientBoostingClassifier(
    loss='log_loss',          # 'log_loss' or 'exponential'
    learning_rate=0.1,        # Shrinkage — most important param
    n_estimators=100,         # Number of trees
    subsample=1.0,            # Fraction of samples per tree (< 1 = stochastic GBM)
    criterion='friedman_mse', # Split quality: 'friedman_mse' or 'squared_error'
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=3,              # Depth of each tree — default 3 is good
    max_features=None,        # Feature subset per split: None, 'sqrt', 'log2', int
    max_leaf_nodes=None,      # Alternative to max_depth
    min_impurity_decrease=0.0,
    n_iter_no_change=None,    # Early stopping patience
    validation_fraction=0.1,  # Val fraction for early stopping
    tol=1e-4,                 # Min improvement threshold for early stopping
    ccp_alpha=0.0,            # Post-pruning complexity parameter
    warm_Start=False,         # Reuse previous fit (add more trees)
    random_state=42,
    verbose=0
)

Priority order for tuning:

1. learning_rate + n_estimators    (together — smaller LR needs more trees)
2. max_depth or max_leaf_nodes     (tree complexity)
3. subsample                       (regularization + speed)
4. min_samples_leaf                (regularization)
5. max_features                    (additional randomization)

9.2 HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier

HistGradientBoostingClassifier(
    loss='log_loss',             # Only 'log_loss' for binary; 'auto' for multi-class
    learning_rate=0.1,
    max_iter=100,                # n_estimators equivalent
    max_leaf_nodes=31,           # Primary complexity control (not max_depth!)
    max_depth=None,              # Optional additional depth limit
    min_samples_leaf=20,         # Default is 20 (not 1 like GBC)
    l2_regularization=0.0,       # L2 penalty on leaf values
    max_bins=255,                # Histogram bins (reduce for speed/regularization)
    categorical_features=None,   # List of categorical column indices
    monotonic_cst=None,          # Monotonic constraints dict
    interaction_cst=None,        # Restrict feature interactions
    warm_Start=False,
    early_stopping='auto',       # 'auto', True, False
    scoring='loss',              # Metric for early stopping
    validation_fraction=0.1,
    n_iter_no_change=10,
    tol=1e-7,
    random_state=42,
    verbose=0
)

Key difference from GBC: max_leaf_nodes=31 (not max_depth=3) is the primary complexity control. 31 leaves ≈ depth-5 tree.

10. Staged Prediction and Learning Curves

GBC's staged_predict / staged_predict_proba allows you to evaluate the model at each boosting round without retraining — invaluable for finding optimal n_estimators:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
import numpy as np

clf = GradientBoostingClassifier(n_estimators=500, learning_rate=0.1,
                                  subsample=0.8, random_state=42)
clf.fit(X_train, y_train)

# Compute loss at each stage
train_scores = []
val_scores   = []

for proba in clf.staged_predict_proba(X_train):
    train_scores.append(log_loss(y_train, proba))

for proba in clf.staged_predict_proba(X_val):
    val_scores.append(log_loss(y_val, proba))

plt.plot(train_scores, label='Train')
plt.plot(val_scores,   label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.legend()
plt.title('GBM Learning Curve — Find Optimal n_estimators')

optimal_n = np.argmin(val_scores) + 1
print(f"Optimal n_estimators: {optimal_n}")

HGBC early stopping (simpler):

clf = HistGradientBoostingClassifier(
    max_iter=1000,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    verbose=1
)
clf.fit(X_train, y_train)
print(f"Stopped at iteration: {clf.n_iter_}")

11. Feature Importance & Interpretability

# Impurity-based importance (both implementations)
importances = clf.feature_importances_

# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(clf, X_val, y_val, n_repeats=20, n_jobs=-1)

# Partial dependence plots (built into sklearn)
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(clf, X_train, features=[0, 1, (0, 1)])

# SHAP (best option)
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)

Note: SHAP's TreeExplainer supports sklearn GBM but may be slower than for XGBoost/LightGBM (less optimized integration). Use HGBC or XGBoost for production SHAP pipelines.

12. The Bias-Variance Profile

Learning rate:  Small α → low variance, higher bias (needs more trees to converge)
max_depth:      Deeper → lower bias, higher variance per tree
n_estimators:   More trees → lower bias (without shrinkage, overfits)
subsample:      < 1.0 → reduces variance (stochastic averaging)
min_samples_leaf: Higher → higher bias, lower variance (smoothing)

The optimal configuration typically has:

Low learning rate (0.01–0.05) + many trees (500–2000) + moderate depth (3–6)
+ subsample (0.7–0.9) + min_samples_leaf tuned to dataset size

13. Assumptions

Assumption	Notes
Differentiable loss	Required for gradient computation
No feature scaling required	Tree splits are scale-invariant
IID samples	Boosting assumes fixed distribution; concept drift degrades performance
No distributional assumption	Non-parametric; no normality required
No extrapolation	Cannot predict beyond training feature ranges

14. Advantages

✅ Reference Implementation

sklearn GBM is the most readable, debuggable gradient boosting implementation. Ideal for learning the algorithm's internals — every step is transparent.

✅ `staged_predict` for Diagnostics

Unique to GBC — plot learning curves, detect overfitting, find optimal rounds — all from a single fit.

✅ HGBC Competitive with XGBoost/LightGBM on Small-Medium Data

On datasets up to ~500k rows, HGBC often matches XGBoost and LightGBM in Accuracy while staying within the sklearn API.

✅ No Extra Dependencies

Part of sklearn — no additional installation. Critical for deployments where package management is constrained.

✅ Full sklearn Pipeline Integration

Both implementations work seamlessly with Pipeline, GridSearchCV, cross_val_score — the entire sklearn ecosystem.

✅ HGBC Native Categorical & Missing Value Support

Eliminates the most tedious preprocessing steps for tabular data.

✅ Monotonic Constraints (HGBC)

Rare feature among sklearn estimators — essential for regulated domains (finance, medicine).

15. Drawbacks & Limitations

❌ GBC Is Slow on Large Datasets

O(m · p · log m) per tree — impractical beyond 50,000 rows. XGBoost and LightGBM are 10–100x faster.

❌ No GPU Support

Neither GBC nor HGBC supports GPU training. For large datasets needing GPU acceleration, use XGBoost or LightGBM.

❌ No Native Sparse Matrix Support (GBC)

GBC requires dense arrays. For sparse text data, use linear models or XGBoost.

❌ Limited Customization

Compared to XGBoost/LightGBM, sklearn's API has fewer knobs — no custom objectives, no custom metrics, limited callback support.

❌ No Second-Order Approximation (GBC)

GBC uses first-order pseudo-residuals for tree fitting. XGBoost's second-order (Hessian) leaf values are more accurate and better regularized.

❌ HGBC Lacks `staged_predict`

The speed gain of histogram methods comes at the cost of GBC's staged prediction capability.

16. sklearn GBM vs. XGBoost vs. LightGBM vs. CatBoost

Property	sklearn GBC	sklearn HGBC	XGBoost	LightGBM	CatBoost
Speed (large data)	❌ Very slow	✅ Fast	✅ Fast	✅✅ Fastest	✅ Fast
GPU support	❌	❌	✅	✅	✅
Missing values	❌ GBC only	✅	✅	✅	✅
Categorical	❌	⚠️ Ordinal	❌	⚠️ Ordinal	✅ Native
Monotonic cst.	❌	✅	✅	✅	✅
2nd order	❌ (GBC)	✅ (HGBC)	✅	✅	❌ (1st order)
Custom loss	❌	❌	✅	✅	✅
staged_predict	✅ GBC only	❌	❌	❌	✅ (iter range)
Extra install	❌ No	❌ No	✅ pip install	✅ pip install	✅ pip install
sklearn Pipeline	✅	✅	✅ (wrapper)	✅ (wrapper)	✅ (wrapper)
Best for	Small data, learning	Medium data	General purpose	Large data	Categoricals

17. Practical Tips & Gotchas

Choosing Between GBC and HGBC

n_samples = X_train.shape[0]

if n_samples < 10_000:
    # GBC is fine; use staged_predict for diagnostics
    from sklearn.ensemble import GradientBoostingClassifier as GBM
elif n_samples < 1_000_000:
    # HGBC is the right choice
    from sklearn.ensemble import HistGradientBoostingClassifier as GBM
else:
    # Use LightGBM or [[XGBoost]] instead
    import LightGBM as lgb

Canonical GBC Setup

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
import numpy as np

clf = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    min_samples_leaf=10,
    random_state=42
)
clf.fit(X_train, y_train)

# Find optimal rounds from staged learning curve
val_losses = [log_loss(y_val, p) for p in clf.staged_predict_proba(X_val)]
best_n = np.argmin(val_losses) + 1

# Retrain with optimal n_estimators
final = GradientBoostingClassifier(
    n_estimators=best_n,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    min_samples_leaf=10,
    random_state=42
)
final.fit(X_train, y_train)

Canonical HGBC Setup

from sklearn.ensemble import HistGradientBoostingClassifier

clf = HistGradientBoostingClassifier(
    max_iter=1000,
    learning_rate=0.05,
    max_leaf_nodes=63,
    min_samples_leaf=20,
    l2_regularization=0.1,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    random_state=42
)
clf.fit(X_train, y_train)
print(f"Best iteration: {clf.n_iter_}")

Native Categorical Features (HGBC)

import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier

# Mark which columns are categorical
categorical_mask = np.zeros(X_train.shape[1], dtype=bool)
categorical_mask[[2, 5, 7]] = True   # Columns 2, 5, 7 are categorical

clf = HistGradientBoostingClassifier(
    categorical_features=categorical_mask
)
# Categorical columns must be integer-encoded (not one-hot)

Warm [[Start]] — Add Trees Without Retraining

clf = GradientBoostingClassifier(n_estimators=100, warm_[[Start]]=True)
clf.fit(X_train, y_train)

# Add 100 more trees without [[Start]]ing over
clf.n_estimators = 200
clf.fit(X_train, y_train)

18. When to Use It

Use GradientBoostingClassifier when:

Dataset is small (< 10k rows) and you want the sklearn ecosystem
You need staged_predict for learning curve diagnostics
You want the most transparent gradient boosting for understanding or teaching
No extra dependencies are acceptable

Use HistGradientBoostingClassifier when:

Dataset is medium (10k–1M rows) within the sklearn ecosystem
Missing values are present and you don't want to impute
Monotonic constraints are required (finance, medicine)
Categorical features without one-hot encoding headache
You need [[LightGBM]]-quality speed with sklearn API

Prefer CatBoost when:

Dataset is very large (> 1M rows)
GPU training is needed
Custom loss functions are required
Maximum [[Accuracy]] via extensive hyperparameter tuning

Summary

┌─────────────────────────────────────────────────────────────────────┐
│          SKLEARN GRADIENT BOOSTING AT A GLANCE                      │
├─────────────────────────────────────────────────────────────────────┤
│  GBC CORE     Exact split finding, staged_predict, Friedman GBM    │
│  HGBC CORE    Histogram bins, native NaN/categorical, monotonic     │
│  TRAINING     F_t = F_{t-1} + α·h_t  (fit tree to pseudo-residuals)│
│  DEFAULT      GBC: max_depth=3, LR=0.1 | HGBC: max_leaf=31, LR=0.1│
│  BEST ASSET   GBC: staged_predict | HGBC: speed + missing values   │
│  LIMITATION   GBC: slow on large data | Both: no GPU               │
│  vs [[XGBoost]]   Less flexible, slower, but zero extra dependencies   │
│  BEST FOR     sklearn pipelines, medium data, monotonic constraints │
└─────────────────────────────────────────────────────────────────────┘

sklearn's gradient boosting implementations are where most practitioners first encounter the algorithm, and for good reason: GBC's code is readable, its staged predictions are diagnostic gold, and HGBC brings the algorithm firmly into the modern era without leaving the sklearn ecosystem. They are not the fastest or most flexible gradient boosting tools — but they are the best [[Start]]ing point, and for many production workloads, they are entirely sufficient.