XGBoost

eXtreme Gradient Boosting

🔗 Related boosting algorithms:

LightGBM — Microsoft's fast gradient boosting

CatBoost — Yandex's categorical-focused boosting

Gradient Boosted Trees (sklearn GBM) — Scikit-learn implementation

AdaBoost — Classic adaptive boosting

Random Forest — Bagging alternative

HistGradientBoostingClassifier — Histogram-based sklearn variant

"The algorithm that won everything, until it didn't — and then kept winning anyway."

1. What Is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an optimized, scalable gradient boosting library introduced by Tianqi Chen and Carlos Guestrin at the University of Washington in 2016. It implements the gradient boosting framework with three major contributions over Friedman's original:

Mathematical: Second-order Taylor expansion of the loss → more accurate leaf values and a closed-form split gain formula
Algorithmic: Weighted quantile sketch for approximate split finding on large data; sparsity-aware split finding for missing values
Systems: Column blocks for cache-efficient access; out-of-core computation for datasets larger than RAM; multi-threading across features

XGBoost dominated competitive machine learning from 2016 to 2018 and remains one of the most widely deployed ML algorithms in production systems worldwide. It is the most cited ML library in Kaggle winning solutions for structured/tabular competitions.

2. Historical Context and Impact

Before XGBoost (2016), gradient boosting existed in sklearn's GradientBoostingClassifier — correct but slow, limiting practical use to smaller datasets.

XGBoost changed this with a combination of mathematical refinement and engineering excellence:

2014: Chen releases first XGBoost implementation
2016: SIGKDD paper published — "XGBoost: A Scalable Tree Boosting System"
2016: 17 of 29 Kaggle competition solutions that used ensemble methods used XGBoost
2017: LightGBM released (Microsoft) — faster on very large datasets
2017: CatBoost released (Yandex) — better categorical handling
2019: XGBoost adds GPU histogram (tree_method='gpu_hist') closing the LightGBM speed gap
2023: XGBoost 2.0 released — rewritten device layer, Apple M1 support, improved API

The paper is one of the most cited in applied ML — over 30,000 citations. "Have you tried XGBoost?" became the standard first question for any tabular data problem.

3. Core Mathematical Innovation — Second-Order Approximation

This is XGBoost's most important contribution. Standard GBM fits trees to first-order pseudo-residuals. XGBoost uses a second-order Taylor expansion for a fundamentally more principled objective.

3.1 The Objective Function

At boosting round t, XGBoost minimizes:

Obj(t) = Σᵢ L(yᵢ, ŷᵢ^(t)) + Σₜ Ω(fₜ)

Where:

L(yᵢ, ŷᵢ^(t)) = differentiable loss on training data
Ω(fₜ) = γT + ½λΣⱼwⱼ² = regularization on tree complexity (T = number of leaves, wⱼ = leaf values)

The regularization term γT penalizes the number of leaves (encourages simpler trees), while ½λΣwⱼ² penalizes large leaf values (L2 shrinkage).

3.2 Taylor Expansion of the Loss

Since we're adding a new tree fₜ to the existing model ŷᵢ^(t-1):

ŷᵢ^(t) = ŷᵢ^(t-1) + fₜ(xᵢ)

Expand the loss around the current predictions using a second-order Taylor expansion:

L(yᵢ, ŷᵢ^(t)) ≈ L(yᵢ, ŷᵢ^(t-1)) + gᵢ·fₜ(xᵢ) + ½hᵢ·fₜ(xᵢ)²

Where:

gᵢ = ∂L(yᵢ, ŷᵢ^(t-1)) / ∂ŷᵢ^(t-1)     (first-order gradient)
hᵢ = ∂²L(yᵢ, ŷᵢ^(t-1)) / ∂(ŷᵢ^(t-1))² (second-order gradient / Hessian)

For log_loss (binary Classification):

p = sigmoid(ŷ)
g = p − y          (prediction error in probability space)
h = p(1 − p)       (variance of Bernoulli — weight of this sample)

The Hessian h = p(1−p) is the variance of the Bernoulli distribution — samples near the decision boundary (p ≈ 0.5) have high Hessian and get upweighted in split finding. This gives XGBoost a natural focus on uncertain examples.

3.3 The Simplified Objective

Dropping constants (terms independent of fₜ), the objective to minimize at round t is:

Obj^(t) ≈ Σᵢ [gᵢ·fₜ(xᵢ) + ½hᵢ·fₜ(xᵢ)²] + Ω(fₜ)

For a tree with T leaves, where sample set of leaf j is Iⱼ:

Obj^(t) = Σⱼ₌₁ᵀ [(Σᵢ∈Iⱼ gᵢ)·wⱼ + ½(Σᵢ∈Iⱼ hᵢ + λ)·wⱼ²] + γT

This is a sum of independent quadratics in each leaf value wⱼ — each can be optimized independently.

3.4 Optimal Leaf Values

For each leaf j, the objective is quadratic in wⱼ. Setting derivative to zero:

∂Obj/∂wⱼ = Gⱼ + (Hⱼ + λ)·wⱼ = 0

→  w*ⱼ = −Gⱼ / (Hⱼ + λ)

Where Gⱼ = Σᵢ∈Iⱼ gᵢ and Hⱼ = Σᵢ∈Iⱼ hᵢ are the sum of gradients and Hessians in leaf j.

Interpretation:

Gⱼ is the total residual signal in the leaf
Hⱼ + λ is the effective curvature (how confident we are in the step) + L2 regularization
Larger λ → smaller leaf values → more conservative updates → less overfitting

Substituting back:

Obj*(leaf j) = −½ · Gⱼ² / (Hⱼ + λ)

The optimal objective value for leaf j is −½ Gⱼ²/(Hⱼ + λ) — the "score" of a leaf. The more signal concentrated in a leaf, the lower (better) the objective.

3.5 The Split Gain Formula

The gain of splitting leaf j into left and right children:

Gain = ½ · [GL²/(HL + λ) + GR²/(HR + λ) − G²/(H + λ)] − γ

Where GL, GR, HL, HR are gradient/Hessian sums in left/right children and G = GL + GR, H = HL + HR.

This formula is XGBoost's most important algorithmic contribution:

Computed without fitting a tree — pure arithmetic on gradient/Hessian sums
Includes γ: minimum gain required to justify a split — acts as pruning
Includes λ: L2 regularization on leaf values — shrinks toward zero
Computable in O(1) per candidate split given pre-sorted gradient/Hessian arrays

For each candidate split, XGBoost evaluates this formula and picks the split with maximum gain. If Gain < 0 for all splits, the node becomes a leaf.

4. Regularization in XGBoost

XGBoost has more explicit regularization than any standard GBM:

Parameter	Effect	Default
`gamma` (γ)	Minimum gain to split a node — larger = more pruning	0
`lambda` (λ)	L2 regularization on leaf values — reduces overfitting	1.0
`alpha` (α)	L1 regularization on leaf values — sparsifies leaf values	0
`max_depth`	Maximum tree depth	6
`min_child_weight`	Minimum Hessian sum required in a child leaf — prevents tiny splits	1
`subsample`	Row sampling per tree (stochastic boosting)	1.0
`colsample_bytree`	Column sampling per tree	1.0
`colsample_bylevel`	Column sampling per tree level	1.0
`colsample_bynode`	Column sampling per split node	1.0
`learning_rate` (η)	Shrinkage factor	0.3

min_child_weight is XGBoost-specific — it requires the sum of Hessians in a child leaf to be ≥ min_child_weight. Since Hessian ≈ p(1−p) for log loss, this approximates requiring a minimum "effective sample count" in each leaf. It's one of the most effective regularization parameters.

5. Tree Growing Strategies

5.1 Exact Greedy Algorithm

For small-to-medium datasets, XGBoost evaluates every possible split on every feature:

For each tree level:
    For each feature f:
        Sort instances by feature value
        For each candidate threshold:
            Compute Gain(f, threshold) using split gain formula
        Return best (f, threshold)

Complexity per tree: O(K · d · m log m) where K = features, d = depth, m = samples.

This is the same O(m · p · log m) as sklearn GBC, but XGBoost's column block structure makes it cache-friendly and significantly faster in practice.

5.2 Approximate Algorithm (Weighted Quantile Sketch)

For large datasets, XGBoost computes approximate quantiles of each feature and evaluates splits only at these quantile boundaries.

The key insight: use weighted quantiles where the weight of each sample is its Hessian hᵢ:

Define rank function: r(z) = (1/Σhᵢ) · Σ_{xᵢ < z} hᵢ

Candidate splits: {z : |r(z) − r(z')| < ε}  where ε controls approximation fineness

Samples with high Hessian (uncertain predictions) contribute more to the quantile computation — they deserve more split candidates in their region. This is the weighted quantile sketch.

Compared to LightGBM's simple equal-frequency binning, XGBoost's weighted sketch is theoretically superior (captures uncertainty structure) but more complex to implement.

Two modes:

tree_method='approx': Compute quantiles fresh before each tree
tree_method='hist': Compute quantile bins once at start (like LightGBM) — faster

5.3 Sparsity-Aware Split Finding

When features are sparse (many zeros or NaN values), XGBoost learns the default direction — which child node to send missing/zero values:

For each feature f and candidate split t:
    Case A: Send all missing values to RIGHT child → compute Gain
    Case B: Send all missing values to LEFT child → compute Gain
    Choose the direction with higher Gain

The learned default direction is stored per node. At prediction time, missing values follow their learned direction — no imputation required.

Why this is powerful: In sparse text or click data, 99% of features are zero for any given sample. Traditional split finding would be O(m) per feature; XGBoost's sparse-aware algorithm skips zeros and runs in O(nnz) — proportional to non-zero entries only.

6. System Engineering Innovations

The XGBoost paper attributed roughly equal importance to algorithmic and systems innovations. The systems work made the algorithm practical at scale.

6.1 Column Block and Cache Access

XGBoost stores data in column blocks — each feature's values (along with gradient and Hessian) in sorted order, stored contiguously in memory.

Column block for feature j:
   sorted values:   [0.01, 0.03, 0.07, 0.12, 0.18, ...]
   sample indices:  [45,   12,   93,   7,    31,   ...]
   gradients:       [g₄₅, g₁₂,  g₉₃,  g₇,   g₃₁, ...]
   hessians:        [h₄₅, h₁₂,  h₉₃,  h₇,   h₃₁, ...]

This layout means split evaluation — accumulating gradient/Hessian sums as you scan the sorted values — is a sequential memory scan, maximally cache-friendly. Accessing random rows (as in the original GBM) causes frequent cache misses; scanning sorted columns avoids them.

The column blocks are computed once at the start of training and reused across all trees. This amortizes the O(m · p · log m) sorting cost over all T trees.

6.2 Out-of-Core Computation

For datasets larger than RAM, XGBoost partitions data into blocks stored on disk. A background thread pre-fetches the next block while the current block is being processed:

Disk → Block buffer (background thread) → GPU/CPU computation

With block compression (using integer indices instead of floating point), disk I/O is reduced further. This allows XGBoost to train on datasets that don't fit in memory — a feature that distinguished it from sklearn GBM entirely.

6.3 Parallelism

XGBoost parallelizes within each tree (feature parallelism), not across trees (which is inherently sequential in boosting):

Within-tree parallelism:
    Each CPU thread processes a different feature's column block simultaneously
    → Split gain computation for all features runs in parallel
    → Speed ≈ linear in number of CPU cores for split finding

For GPU training (tree_method='gpu_hist'):

GPU thread = one sample's histogram bin
All samples' histogram contributions computed simultaneously on GPU
Enables 10–100x speedup over CPU for large datasets

7. Handling Missing Values

XGBoost handles missing values through its sparsity-aware split finding (Section 5.3). In addition:

missing parameter: Specify what value represents "missing" (default: NaN). Any value (e.g., -999, 0) can be treated as missing.
Default direction learning: At each node, XGBoost learns whether missing values should go left or right for maximum gain — this is stored and used at prediction time.
No imputation needed: Pass NaN values directly; XGBoost handles them internally.

import xgboost as xgb
import numpy as np

# NaN values in X_train are handled natively
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train)   # Works with NaN

# Custom missing value marker
clf = xgb.XGBClassifier(missing=-999)
clf.fit(X_train_with_minus999, y_train)

8. Monotonic Constraints and Interaction Constraints

Monotonic Constraints

Force the model to be monotonically increasing (+1) or decreasing (-1) with respect to specific features:

clf = xgb.XGBClassifier(
    monotone_constraints="(1, 0, -1, 0)"  # Feature 0 increasing, feature 2 decreasing
)
# Or as dict (XGBoost 1.6+)
clf = xgb.XGBClassifier(
    monotone_constraints={"age": 1, "debt_ratio": -1}
)

Implementation: After each split, XGBoost checks if the constraint is satisfied. If the right child's leaf value is not ≥ left child's (for increasing constraint), the split is rejected and the next best is tried.

Interaction Constraints

Restrict which features can appear together in a tree — enforces feature independence between groups:

# Feature group 0: [0, 1, 2]  Feature group 1: [3, 4, 5]
# Trees can only use features within one group, not across groups
clf = xgb.XGBClassifier(
    interaction_constraints="[[0,1,2],[3,4,5]]"
)

Useful for:

Fairness constraints (prevent model from mixing protected and non-protected features)
Domain-specific independence requirements
Debugging which feature groups matter

9. XGBoost for Multi-Class and Ranking

Multi-Class

# Softmax for multi-class probabilities
clf = xgb.XGBClassifier(
    objective='multi:softmax',   # Returns class labels
    num_class=5
)
# Or
clf = xgb.XGBClassifier(
    objective='multi:softprob',  # Returns class probabilities
    num_class=5
)

Like sklearn GBM, XGBoost trains K trees per round for K-class problems. The gradients are computed from the multinomial log-loss.

Ranking (LambdaMART)

dtrain = xgb.DMatrix(X_train, label=y_relevance, qid=query_ids)

params = {
    'objective': 'rank:pairwise',   # LambdaRank
    'eval_metric': 'ndcg',
    'lambdarank_num_pair_per_sample': 8
}
model = xgb.train(params, dtrain)

XGBoost implements LambdaMART — one of the strongest learning-to-rank algorithms. Used in search engines and recommendation systems.

10. Hyperparameters — Complete Reference

10.1 General Parameters

Parameter	Description	Default
`booster`	`'gbtree'`, `'gblinear'`, `'dart'`	`gbtree`
`nthread`	Number of parallel threads	max
`verbosity`	0 (silent) to 3 (debug)	1
`seed`	Random seed	0

10.2 Booster Parameters (gbtree)

Parameter	Description	Default	Notes
`learning_rate` (eta)	Shrinkage — most important param	0.3	Typical: 0.01–0.1
`n_estimators`	Number of boosting rounds	100	Use early stopping
`max_depth`	Maximum tree depth	6	Typical: 3–8
`min_child_weight`	Min Hessian sum per leaf	1	Key regularizer for noise
`gamma`	Min gain for a split	0	0–20; prunes low-gain splits
`subsample`	Row sampling fraction per tree	1.0	0.5–0.9 typical
`colsample_bytree`	Feature fraction per tree	1.0	0.5–0.9 typical
`colsample_bylevel`	Feature fraction per depth level	1.0	Additional randomization
`colsample_bynode`	Feature fraction per split node	1.0	Most granular; like RF
`reg_alpha`	L1 regularization on leaf weights	0	For sparse feature importance
`reg_lambda`	L2 regularization on leaf weights	1.0	Most important L2 regularizer
`max_delta_step`	Max absolute leaf value (helps class imbalance)	0	Set 1–10 for severe imbalance
`tree_method`	`'exact'`, `'approx'`, `'hist'`, `'gpu_hist'`	`'auto'`	`'hist'` for large data, GPU
`scale_pos_weight`	Positive class weight for imbalance	1	Set to neg/pos ratio
`grow_policy`	`'depthwise'` or `'lossguide'`	`depthwise`	`lossguide` = leaf-wise (LGB)
`max_leaves`	Max leaves (only for `lossguide`)	0	Like LightGBM's num_leaves

10.3 Learning Task Parameters

Parameter	Description	Default
`objective`	Loss function (see table below)	`reg:squarederror`
`eval_metric`	Metric for evaluation/early stopping	auto
`base_score`	Initial prediction (global bias)	0.5
`seed`	Random seed	0

Common objectives:

Objective	Task
`binary:logistic`	Binary classification (probs)
`binary:logitraw`	Binary classification (log-odds)
`multi:softmax`	Multi-class (class labels)
`multi:softprob`	Multi-class (probabilities)
`reg:squarederror`	Regression (MSE)
`reg:absoluteerror`	Regression (MAE)
`reg:pseudohubererror`	Regression (Huber)
`reg:quantileerror`	Quantile regression
`rank:pairwise`	Ranking (LambdaRank)
`rank:ndcg`	Ranking (LambdaNDCG)
`survival:cox`	Survival analysis
Custom function	Any twice-differentiable loss

11. Feature Importance Types

XGBoost provides three built-in importance metrics:

# Weight — number of times a feature is used in a split
clf.get_booster().get_score(importance_type='weight')

# Gain — average training loss reduction when feature is used in splits
clf.get_booster().get_score(importance_type='gain')   # Most informative

# Cover — average number of samples in splits using this feature
clf.get_booster().get_score(importance_type='cover')

# Total gain / total cover (sum instead of average)
clf.get_booster().get_score(importance_type='total_gain')
clf.get_booster().get_score(importance_type='total_cover')

Recommendation: Use gain as the default. weight is biased toward features with many possible split values (continuous features). SHAP values supersede all built-in metrics for production interpretability.

import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
shap.waterfall_plot(explainer(X_test)[0])

12. Dart Booster

DART (Dropouts meet Multiple Additive Regression Trees) — XGBoost's variant that applies dropout (from neural networks) to gradient boosting:

clf = xgb.XGBClassifier(
    booster='dart',
    rate_drop=0.1,     # Fraction of trees to drop per round
    skip_drop=0.5,     # Probability of skipping dropout for a round
    sample_type='uniform',   # Or 'weighted'
    normalize_type='tree'    # Or 'forest'
)

Mechanism: During each boosting round, randomly drop a subset of existing trees and train the new tree on the residuals of the remaining trees. The dropped trees are then re-added with rescaled weights.

Effect: Prevents any single tree from being over-relied upon — a form of ensemble regularization. Often achieves better generalization than gbtree on noisy datasets.

Caveats:

early_stopping_rounds does not work with DART
Prediction is slower (must handle dropped trees)
Rarely a large improvement over well-tuned gbtree + subsample

13. Linear Booster

XGBoost can use a linear model as the base learner instead of trees:

clf = xgb.XGBClassifier(
    booster='gblinear',
    reg_alpha=0.1,   # L1 (lasso)
    reg_lambda=1.0,  # L2 (ridge)
    updater='shotgun'  # Or 'coord_descent'
)

When it's useful: High-dimensional sparse data (text classification) where linear models are appropriate and tree-based splits provide no advantage. Effectively implements regularized Logistic Regression via boosting. Rarely competitive with LightGBM or sklearn's SGD classifiers in this regime.

14. The Bias-Variance Profile

XGBoost's second-order approximation and explicit regularization (γ, λ, α, min_child_weight) give it finer-grained bias-variance control than sklearn GBC:

High learning_rate + low n_estimators → high bias (underfits)
Low learning_rate + high n_estimators → low bias, needs regularization to control variance
High gamma                            → high bias (aggressive pruning)
High lambda / min_child_weight        → reduced variance (smoother leaf values)
Low subsample / colsample             → more variance reduction (stochastic boosting)

Empirically:

Best XGBoost configuration:
  learning_rate: 0.01–0.05
  n_estimators:  500–3000 (found via early stopping)
  max_depth:     4–8
  min_child_weight: 1–10 (tune this — it's often the most impactful after LR)
  subsample:     0.7–0.9
  colsample_bytree: 0.5–0.8
  reg_lambda:    0.5–5.0

15. Assumptions

Assumption	Notes
Twice-differentiable loss	Required for gradient AND Hessian
IID samples	Standard Supervised Learning assumption
No feature scaling needed	Tree splits are scale-invariant
No distributional assumption	Non-parametric — no normality or linearity required
No extrapolation	Tree-based — flat outside training range
Moderate noise tolerance	Better than AdaBoost; Hessian weighting down-weights uncertain samples

16. Advantages

✅ Best-in-Class Accuracy (Tabular Data)

Consistently achieves top performance on tabular ML benchmarks. The standard to beat.

✅ Second-Order Approximation

More accurate leaf values and split decisions than first-order GBM. Principled regularization via the split gain formula.

✅ Flexible Loss Functions

Any twice-differentiable loss — including fully custom Python objectives.

✅ GPU Training

tree_method='gpu_hist' provides 10–100x speedup on large datasets.

✅ Native Missing Value Handling

Sparsity-aware split finding — no imputation, learns optimal default directions.

✅ Rich Regularization

γ (pruning), λ (L2), α (L1), min_child_weight, subsample, colsample — multiple orthogonal regularization axes.

✅ Multiple Feature Importance Types

Weight, gain, cover — plus full SHAP support via shap.TreeExplainer.

✅ Extensive Ecosystem

sklearn API via XGBClassifier, DMatrix native API, Spark/Dask/Ray integration, ONNX export, cuML compatibility.

✅ Early Stopping

Built-in with eval_set and early_stopping_rounds.

✅ Monotonic and Interaction Constraints

For regulated or domain-constrained models.

17. Drawbacks & Limitations

❌ Slower Than LightGBM on Large Datasets

LightGBM's leaf-wise growth and GOSS sampling are faster than XGBoost's depth-wise approach at scale (> 500k rows). XGBoost's gpu_hist closes this gap on GPU.

❌ No Native Categorical Support

Must one-hot encode or ordinal encode categoricals manually. CatBoost handles this natively and usually outperforms when categoricals dominate.

❌ Many Hyperparameters to Tune

More than sklearn GBM, though the defaults are reasonable. Tuning XGBoost well requires understanding the interaction between eta, max_depth, min_child_weight, gamma, and the regularization parameters.

❌ Sequential Training

Like all boosting — each tree depends on the previous. Cannot parallelize across trees. Internal feature parallelism helps but doesn't scale as linearly as Random Forest.

❌ No Extrapolation

Flat predictions outside the training range — inherits from Decision Trees.

❌ Memory for Column Blocks

Storing sorted column blocks requires O(m · p) additional memory — can be 2–3x the raw data size.

18. XGBoost vs. LightGBM vs. CatBoost

Property	XGBoost	[[LightGBM]]	CatBoost
Speed (CPU)	Fast	✅✅ Fastest	Fast
Speed (GPU)	✅ Fast (gpu_hist)	✅ Fast	✅ Fast
Memory	High (column blocks)	Low (histograms)	Moderate
Tree growth	Depth-wise (default)	Leaf-wise	Oblivious (symmetric)
2nd order (Hessian)	✅ Yes	✅ Yes	❌ No
Categorical features	❌ Manual encoding	⚠️ Basic ordinal	✅✅ Native ordered
Missing values	✅ Native (sparse)	✅ Native	✅ Native
Monotonic constraints	✅ Yes	✅ Yes	✅ Yes
Custom loss	✅ Yes (grad+hess)	✅ Yes	✅ Yes
Regularization	✅ Rich (γ,λ,α,mcw)	✅ Good	✅ Good
SHAP support	✅ Best (TreeExplainer)	✅ Very good	✅ Good
sklearn API	✅ XGBClassifier	✅ LGBMClassifier	✅ CatBoostClassifier
Production maturity	✅✅ Very high	✅ High	✅ High
Best for	General, imbalanced	Very large data	Categorical-heavy

19. Practical Tips & Gotchas

Canonical Fast Setup (sklearn API)

import xgboost as xgb

clf = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.7,
    reg_alpha=0.1,
    reg_lambda=2.0,
    scale_pos_weight=1,          # Adjust for class imbalance
    tree_method='hist',          # Fast for medium+ datasets
    eval_metric='logloss',
    early_stopping_rounds=50,
    use_label_encoder=False,
    n_jobs=-1,
    random_state=42
)

clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=100
)
print(f"Best round: {clf.best_iteration}, Best score: {clf.best_score}")

Native DMatrix API (Faster for Large Data)

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dval   = xgb.DMatrix(X_val,   label=y_val)
dtest  = xgb.DMatrix(X_test)

params = {
    'objective':        'binary:logistic',
    'eval_metric':      'auc',
    'eta':              0.05,
    'max_depth':        6,
    'min_child_weight': 5,
    'subsample':        0.8,
    'colsample_bytree': 0.7,
    'reg_lambda':       2.0,
    'reg_alpha':        0.1,
    'tree_method':      'hist',
    'seed':             42
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

preds = model.predict(dtest)

Custom Objective Function

def focal_loss_objective(y_pred, dtrain):
    """Focal loss — downweights easy examples for class imbalance"""
    y_true = dtrain.get_label()
    gamma = 2.0
    alpha = 0.25

    p = 1 / (1 + np.exp(-y_pred))
    pt = np.where(y_true == 1, p, 1 - p)
    at = np.where(y_true == 1, alpha, 1 - alpha)

    # Gradient
    grad = at * (1 - pt)**gamma * (gamma * pt * np.log(pt + 1e-7) + pt - y_true)

    # Hessian (approximate — use second derivative of focal loss)
    hess = at * (1 - pt)**gamma * (2 * gamma * pt * (1 - pt) * np.log(pt + 1e-7)
                                    + (1 - 2*pt) * gamma * (1 - pt) + pt * (1 - pt))

    return grad, hess

model = xgb.train(params, dtrain, obj=focal_loss_objective)

Class Imbalance

# Method 1: scale_pos_weight (simplest)
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
clf = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

# Method 2: max_delta_step (sometimes helps with severe imbalance)
clf = xgb.XGBClassifier(max_delta_step=1)

# Method 3: Adjust decision threshold post-hoc
from sklearn.metrics import [[Precision]]_recall_curve
probs = clf.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = [[Precision]]_recall_curve(y_val, probs)
# Pick threshold that maximizes F1

GPU Training

clf = xgb.XGBClassifier(
    tree_method='gpu_hist',   # GPU histogram
    device='cuda',            # XGBoost 2.0+ syntax
    n_estimators=2000,
    early_stopping_rounds=50
)

20. When to Use It

Use XGBoost when:

You need state-of-the-art Accuracy on tabular data with a robust, battle-tested library
You have custom loss functions or novel objectives (XGBoost's custom obj API is mature)
Class imbalance is present — scale_pos_weight and max_delta_step are well-tested
You need SHAP explanations at scale — TreeExplainer support is excellent
Ranking tasks — LambdaMART is production-quality
Dataset is medium to large (10k–50M rows)
GPU training is available and the dataset warrants it
You need fine-grained regularization control (γ, λ, α, min_child_weight)
You need production stability — XGBoost has the largest deployment history

Consider [[LightGBM]] instead when:

Dataset is very large (> 5M rows) and CPU training is needed
Memory is constrained
Training speed is the bottleneck

Consider CatBoost instead when:

Categorical features dominate and automatic encoding matters
You want less hyperparameter tuning ([[CatBoost]] defaults are very strong)

Summary

┌─────────────────────────────────────────────────────────────────────┐
│                    XGBOOST AT A GLANCE                              │
├─────────────────────────────────────────────────────────────────────┤
│  CORE MATH    2nd-order Taylor expansion → closed-form split gain   │
│  SPLIT GAIN   ½[GL²/(HL+λ) + GR²/(HR+λ) − G²/(H+λ)] − γ          │
│  LEAF VALUE   w* = −G / (H + λ)                                     │
│  REGULARIZE   γ (pruning), λ (L2), α (L1), min_child_weight        │
│  MISSING      Sparsity-aware: learns default direction per split    │
│  GPU          tree_method='gpu_hist'  — 10–100x speedup            │
│  BEST PARAMS  LR=0.01–0.05 + n_est via ES + max_depth=4–8          │
│  STRENGTH     [[Accuracy]], flexibility, regularization, SHAP, ranking  │
│  WEAKNESS     Slower than LGB at scale, no native categoricals      │
│  BEST FOR     General-purpose tabular, custom objectives, ranking   │
└─────────────────────────────────────────────────────────────────────┘

XGBoost is the algorithm that taught the ML community that engineering and mathematics are not separate concerns — they compound. The second-order Taylor expansion is mathematically elegant; the column block is a systems insight; the weighted quantile sketch bridges both. The result was an algorithm that was simultaneously more principled and faster than its predecessors — proving that theoretical depth and engineering pragmatism reinforce each other. Every competitive ML practitioner needs to understand it at this level.