XGBoost

XGBoost

eXtreme Gradient Boosting

πŸ”— Related boosting algorithms:


"The algorithm that won everything, until it didn't β€” and then kept winning anyway."


1. What Is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an optimized, scalable gradient boosting library introduced by Tianqi Chen and Carlos Guestrin at the University of Washington in 2016. It implements the gradient boosting framework with three major contributions over Friedman's original:

  1. Mathematical: Second-order Taylor expansion of the loss β†’ more accurate leaf values and a closed-form split gain formula

  2. Algorithmic: Weighted quantile sketch for approximate split finding on large data; sparsity-aware split finding for missing values

  3. Systems: Column blocks for cache-efficient access; out-of-core computation for datasets larger than RAM; multi-threading across features

XGBoost dominated competitive machine learning from 2016 to 2018 and remains one of the most widely deployed ML algorithms in production systems worldwide. It is the most cited ML library in Kaggle winning solutions for structured/tabular competitions.

2. Historical Context and Impact

Before XGBoost (2016), gradient boosting existed in sklearn's GradientBoostingClassifier β€” correct but slow, limiting practical use to smaller datasets.

XGBoost changed this with a combination of mathematical refinement and engineering excellence:

2014: Chen releases first XGBoost implementation
2016: SIGKDD paper published β€” "XGBoost: A Scalable Tree Boosting System"
2016: 17 of 29 Kaggle competition solutions that used ensemble methods used XGBoost
2017: LightGBM released (Microsoft) β€” faster on very large datasets
2017: CatBoost released (Yandex) β€” better categorical handling
2019: XGBoost adds GPU histogram (tree_method='gpu_hist') closing the LightGBM speed gap
2023: XGBoost 2.0 released β€” rewritten device layer, Apple M1 support, improved API

The paper is one of the most cited in applied ML β€” over 30,000 citations. "Have you tried XGBoost?" became the standard first question for any tabular data problem.


3. Core Mathematical Innovation β€” Second-Order Approximation

This is XGBoost's most important contribution. Standard GBM fits trees to first-order pseudo-residuals. XGBoost uses a second-order Taylor expansion for a fundamentally more principled objective.

3.1 The Objective Function

At boosting round t, XGBoost minimizes:

Obj(t) = Ξ£α΅’ L(yα΅’, Ε·α΅’^(t)) + Ξ£β‚œ Ξ©(fβ‚œ)

Where:

The regularization term γT penalizes the number of leaves (encourages simpler trees), while ½λΣwⱼ² penalizes large leaf values (L2 shrinkage).


3.2 Taylor Expansion of the Loss

Since we're adding a new tree fβ‚œ to the existing model Ε·α΅’^(t-1):

Ε·α΅’^(t) = Ε·α΅’^(t-1) + fβ‚œ(xα΅’)

Expand the loss around the current predictions using a second-order Taylor expansion:

L(yα΅’, Ε·α΅’^(t)) β‰ˆ L(yα΅’, Ε·α΅’^(t-1)) + gα΅’Β·fβ‚œ(xα΅’) + Β½hα΅’Β·fβ‚œ(xα΅’)Β²

Where:

gα΅’ = βˆ‚L(yα΅’, Ε·α΅’^(t-1)) / βˆ‚Ε·α΅’^(t-1)     (first-order gradient)
hα΅’ = βˆ‚Β²L(yα΅’, Ε·α΅’^(t-1)) / βˆ‚(Ε·α΅’^(t-1))Β² (second-order gradient / Hessian)

For log_loss (binary Classification):

p = sigmoid(Ε·)
g = p βˆ’ y          (prediction error in probability space)
h = p(1 βˆ’ p)       (variance of Bernoulli β€” weight of this sample)

The Hessian h = p(1βˆ’p) is the variance of the Bernoulli distribution β€” samples near the decision boundary (p β‰ˆ 0.5) have high Hessian and get upweighted in split finding. This gives XGBoost a natural focus on uncertain examples.


3.3 The Simplified Objective

Dropping constants (terms independent of fβ‚œ), the objective to minimize at round t is:

Obj^(t) β‰ˆ Ξ£α΅’ [gα΅’Β·fβ‚œ(xα΅’) + Β½hα΅’Β·fβ‚œ(xα΅’)Β²] + Ξ©(fβ‚œ)

For a tree with T leaves, where sample set of leaf j is Iβ±Ό:

Obj^(t) = Ξ£β±Όβ‚Œβ‚α΅€ [(Σᡒ∈Iβ±Ό gα΅’)Β·wβ±Ό + Β½(Σᡒ∈Iβ±Ό hα΅’ + Ξ»)Β·wβ±ΌΒ²] + Ξ³T

This is a sum of independent quadratics in each leaf value wβ±Ό β€” each can be optimized independently.


3.4 Optimal Leaf Values

For each leaf j, the objective is quadratic in wβ±Ό. Setting derivative to zero:

βˆ‚Obj/βˆ‚wβ±Ό = Gβ±Ό + (Hβ±Ό + Ξ»)Β·wβ±Ό = 0

β†’  w*β±Ό = βˆ’Gβ±Ό / (Hβ±Ό + Ξ»)

Where Gⱼ = Σᡒ∈Iⱼ gᡒ and Hⱼ = Σᡒ∈Iⱼ hᡒ are the sum of gradients and Hessians in leaf j.

Interpretation:

Substituting back:

Obj*(leaf j) = βˆ’Β½ Β· Gβ±ΌΒ² / (Hβ±Ό + Ξ»)

The optimal objective value for leaf j is βˆ’Β½ Gβ±ΌΒ²/(Hβ±Ό + Ξ») β€” the "score" of a leaf. The more signal concentrated in a leaf, the lower (better) the objective.


3.5 The Split Gain Formula

The gain of splitting leaf j into left and right children:

Gain = Β½ Β· [GLΒ²/(HL + Ξ») + GRΒ²/(HR + Ξ») βˆ’ GΒ²/(H + Ξ»)] βˆ’ Ξ³

Where GL, GR, HL, HR are gradient/Hessian sums in left/right children and G = GL + GR, H = HL + HR.

This formula is XGBoost's most important algorithmic contribution:

For each candidate split, XGBoost evaluates this formula and picks the split with maximum gain. If Gain < 0 for all splits, the node becomes a leaf.


4. Regularization in XGBoost

XGBoost has more explicit regularization than any standard GBM:

Parameter Effect Default
gamma (Ξ³) Minimum gain to split a node β€” larger = more pruning 0
lambda (Ξ») L2 regularization on leaf values β€” reduces overfitting 1.0
alpha (Ξ±) L1 regularization on leaf values β€” sparsifies leaf values 0
max_depth Maximum tree depth 6
min_child_weight Minimum Hessian sum required in a child leaf β€” prevents tiny splits 1
subsample Row sampling per tree (stochastic boosting) 1.0
colsample_bytree Column sampling per tree 1.0
colsample_bylevel Column sampling per tree level 1.0
colsample_bynode Column sampling per split node 1.0
learning_rate (Ξ·) Shrinkage factor 0.3

min_child_weight is XGBoost-specific β€” it requires the sum of Hessians in a child leaf to be β‰₯ min_child_weight. Since Hessian β‰ˆ p(1βˆ’p) for log loss, this approximates requiring a minimum "effective sample count" in each leaf. It's one of the most effective regularization parameters.


5. Tree Growing Strategies

5.1 Exact Greedy Algorithm

For small-to-medium datasets, XGBoost evaluates every possible split on every feature:

For each tree level:
    For each feature f:
        Sort instances by feature value
        For each candidate threshold:
            Compute Gain(f, threshold) using split gain formula
        Return best (f, threshold)

Complexity per tree: O(K Β· d Β· m log m) where K = features, d = depth, m = samples.

This is the same O(m Β· p Β· log m) as sklearn GBC, but XGBoost's column block structure makes it cache-friendly and significantly faster in practice.


5.2 Approximate Algorithm (Weighted Quantile Sketch)

For large datasets, XGBoost computes approximate quantiles of each feature and evaluates splits only at these quantile boundaries.

The key insight: use weighted quantiles where the weight of each sample is its Hessian hα΅’:

Define rank function: r(z) = (1/Ξ£hα΅’) Β· Ξ£_{xα΅’ < z} hα΅’

Candidate splits: {z : |r(z) βˆ’ r(z')| < Ξ΅}  where Ξ΅ controls approximation fineness

Samples with high Hessian (uncertain predictions) contribute more to the quantile computation β€” they deserve more split candidates in their region. This is the weighted quantile sketch.

Compared to LightGBM's simple equal-frequency binning, XGBoost's weighted sketch is theoretically superior (captures uncertainty structure) but more complex to implement.

Two modes:


5.3 Sparsity-Aware Split Finding

When features are sparse (many zeros or NaN values), XGBoost learns the default direction β€” which child node to send missing/zero values:

For each feature f and candidate split t:
    Case A: Send all missing values to RIGHT child β†’ compute Gain
    Case B: Send all missing values to LEFT child β†’ compute Gain
    Choose the direction with higher Gain

The learned default direction is stored per node. At prediction time, missing values follow their learned direction β€” no imputation required.

Why this is powerful: In sparse text or click data, 99% of features are zero for any given sample. Traditional split finding would be O(m) per feature; XGBoost's sparse-aware algorithm skips zeros and runs in O(nnz) β€” proportional to non-zero entries only.


6. System Engineering Innovations

The XGBoost paper attributed roughly equal importance to algorithmic and systems innovations. The systems work made the algorithm practical at scale.

6.1 Column Block and Cache Access

XGBoost stores data in column blocks β€” each feature's values (along with gradient and Hessian) in sorted order, stored contiguously in memory.

Column block for feature j:
   sorted values:   [0.01, 0.03, 0.07, 0.12, 0.18, ...]
   sample indices:  [45,   12,   93,   7,    31,   ...]
   gradients:       [gβ‚„β‚…, g₁₂,  g₉₃,  g₇,   g₃₁, ...]
   hessians:        [hβ‚„β‚…, h₁₂,  h₉₃,  h₇,   h₃₁, ...]

This layout means split evaluation β€” accumulating gradient/Hessian sums as you scan the sorted values β€” is a sequential memory scan, maximally cache-friendly. Accessing random rows (as in the original GBM) causes frequent cache misses; scanning sorted columns avoids them.

The column blocks are computed once at the start of training and reused across all trees. This amortizes the O(m Β· p Β· log m) sorting cost over all T trees.


6.2 Out-of-Core Computation

For datasets larger than RAM, XGBoost partitions data into blocks stored on disk. A background thread pre-fetches the next block while the current block is being processed:

Disk β†’ Block buffer (background thread) β†’ GPU/CPU computation

With block compression (using integer indices instead of floating point), disk I/O is reduced further. This allows XGBoost to train on datasets that don't fit in memory β€” a feature that distinguished it from sklearn GBM entirely.


6.3 Parallelism

XGBoost parallelizes within each tree (feature parallelism), not across trees (which is inherently sequential in boosting):

Within-tree parallelism:
    Each CPU thread processes a different feature's column block simultaneously
    β†’ Split gain computation for all features runs in parallel
    β†’ Speed β‰ˆ linear in number of CPU cores for split finding

For GPU training (tree_method='gpu_hist'):

GPU thread = one sample's histogram bin
All samples' histogram contributions computed simultaneously on GPU
Enables 10–100x speedup over CPU for large datasets

7. Handling Missing Values

XGBoost handles missing values through its sparsity-aware split finding (Section 5.3). In addition:

import xgboost as xgb
import numpy as np

# NaN values in X_train are handled natively
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train)   # Works with NaN

# Custom missing value marker
clf = xgb.XGBClassifier(missing=-999)
clf.fit(X_train_with_minus999, y_train)

8. Monotonic Constraints and Interaction Constraints

Monotonic Constraints

Force the model to be monotonically increasing (+1) or decreasing (-1) with respect to specific features:

clf = xgb.XGBClassifier(
    monotone_constraints="(1, 0, -1, 0)"  # Feature 0 increasing, feature 2 decreasing
)
# Or as dict (XGBoost 1.6+)
clf = xgb.XGBClassifier(
    monotone_constraints={"age": 1, "debt_ratio": -1}
)

Implementation: After each split, XGBoost checks if the constraint is satisfied. If the right child's leaf value is not β‰₯ left child's (for increasing constraint), the split is rejected and the next best is tried.

Interaction Constraints

Restrict which features can appear together in a tree β€” enforces feature independence between groups:

# Feature group 0: [0, 1, 2]  Feature group 1: [3, 4, 5]
# Trees can only use features within one group, not across groups
clf = xgb.XGBClassifier(
    interaction_constraints="[[0,1,2],[3,4,5]]"
)

Useful for:


9. XGBoost for Multi-Class and Ranking

Multi-Class

# Softmax for multi-class probabilities
clf = xgb.XGBClassifier(
    objective='multi:softmax',   # Returns class labels
    num_class=5
)
# Or
clf = xgb.XGBClassifier(
    objective='multi:softprob',  # Returns class probabilities
    num_class=5
)

Like sklearn GBM, XGBoost trains K trees per round for K-class problems. The gradients are computed from the multinomial log-loss.

Ranking (LambdaMART)

dtrain = xgb.DMatrix(X_train, label=y_relevance, qid=query_ids)

params = {
    'objective': 'rank:pairwise',   # LambdaRank
    'eval_metric': 'ndcg',
    'lambdarank_num_pair_per_sample': 8
}
model = xgb.train(params, dtrain)

XGBoost implements LambdaMART β€” one of the strongest learning-to-rank algorithms. Used in search engines and recommendation systems.


10. Hyperparameters β€” Complete Reference

10.1 General Parameters

Parameter Description Default
booster 'gbtree', 'gblinear', 'dart' gbtree
nthread Number of parallel threads max
verbosity 0 (silent) to 3 (debug) 1
seed Random seed 0

10.2 Booster Parameters (gbtree)

Parameter Description Default Notes
learning_rate (eta) Shrinkage β€” most important param 0.3 Typical: 0.01–0.1
n_estimators Number of boosting rounds 100 Use early stopping
max_depth Maximum tree depth 6 Typical: 3–8
min_child_weight Min Hessian sum per leaf 1 Key regularizer for noise
gamma Min gain for a split 0 0–20; prunes low-gain splits
subsample Row sampling fraction per tree 1.0 0.5–0.9 typical
colsample_bytree Feature fraction per tree 1.0 0.5–0.9 typical
colsample_bylevel Feature fraction per depth level 1.0 Additional randomization
colsample_bynode Feature fraction per split node 1.0 Most granular; like RF
reg_alpha L1 regularization on leaf weights 0 For sparse feature importance
reg_lambda L2 regularization on leaf weights 1.0 Most important L2 regularizer
max_delta_step Max absolute leaf value (helps class imbalance) 0 Set 1–10 for severe imbalance
tree_method 'exact', 'approx', 'hist', 'gpu_hist' 'auto' 'hist' for large data, GPU
scale_pos_weight Positive class weight for imbalance 1 Set to neg/pos ratio
grow_policy 'depthwise' or 'lossguide' depthwise lossguide = leaf-wise (LGB)
max_leaves Max leaves (only for lossguide) 0 Like LightGBM's num_leaves

10.3 Learning Task Parameters

Parameter Description Default
objective Loss function (see table below) reg:squarederror
eval_metric Metric for evaluation/early stopping auto
base_score Initial prediction (global bias) 0.5
seed Random seed 0

Common objectives:

Objective Task
binary:logistic Binary classification (probs)
binary:logitraw Binary classification (log-odds)
multi:softmax Multi-class (class labels)
multi:softprob Multi-class (probabilities)
reg:squarederror Regression (MSE)
reg:absoluteerror Regression (MAE)
reg:pseudohubererror Regression (Huber)
reg:quantileerror Quantile regression
rank:pairwise Ranking (LambdaRank)
rank:ndcg Ranking (LambdaNDCG)
survival:cox Survival analysis
Custom function Any twice-differentiable loss

11. Feature Importance Types

XGBoost provides three built-in importance metrics:

# Weight β€” number of times a feature is used in a split
clf.get_booster().get_score(importance_type='weight')

# Gain β€” average training loss reduction when feature is used in splits
clf.get_booster().get_score(importance_type='gain')   # Most informative

# Cover β€” average number of samples in splits using this feature
clf.get_booster().get_score(importance_type='cover')

# Total gain / total cover (sum instead of average)
clf.get_booster().get_score(importance_type='total_gain')
clf.get_booster().get_score(importance_type='total_cover')

Recommendation: Use gain as the default. weight is biased toward features with many possible split values (continuous features). SHAP values supersede all built-in metrics for production interpretability.

import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
shap.waterfall_plot(explainer(X_test)[0])

12. Dart Booster

DART (Dropouts meet Multiple Additive Regression Trees) β€” XGBoost's variant that applies dropout (from neural networks) to gradient boosting:

clf = xgb.XGBClassifier(
    booster='dart',
    rate_drop=0.1,     # Fraction of trees to drop per round
    skip_drop=0.5,     # Probability of skipping dropout for a round
    sample_type='uniform',   # Or 'weighted'
    normalize_type='tree'    # Or 'forest'
)

Mechanism: During each boosting round, randomly drop a subset of existing trees and train the new tree on the residuals of the remaining trees. The dropped trees are then re-added with rescaled weights.

Effect: Prevents any single tree from being over-relied upon β€” a form of ensemble regularization. Often achieves better generalization than gbtree on noisy datasets.

Caveats:


13. Linear Booster

XGBoost can use a linear model as the base learner instead of trees:

clf = xgb.XGBClassifier(
    booster='gblinear',
    reg_alpha=0.1,   # L1 (lasso)
    reg_lambda=1.0,  # L2 (ridge)
    updater='shotgun'  # Or 'coord_descent'
)

When it's useful: High-dimensional sparse data (text classification) where linear models are appropriate and tree-based splits provide no advantage. Effectively implements regularized Logistic Regression via boosting. Rarely competitive with LightGBM or sklearn's SGD classifiers in this regime.


14. The Bias-Variance Profile

XGBoost's second-order approximation and explicit regularization (Ξ³, Ξ», Ξ±, min_child_weight) give it finer-grained bias-variance control than sklearn GBC:

High learning_rate + low n_estimators β†’ high bias (underfits)
Low learning_rate + high n_estimators β†’ low bias, needs regularization to control variance
High gamma                            β†’ high bias (aggressive pruning)
High lambda / min_child_weight        β†’ reduced variance (smoother leaf values)
Low subsample / colsample             β†’ more variance reduction (stochastic boosting)

Empirically:

Best XGBoost configuration:
  learning_rate: 0.01–0.05
  n_estimators:  500–3000 (found via early stopping)
  max_depth:     4–8
  min_child_weight: 1–10 (tune this β€” it's often the most impactful after LR)
  subsample:     0.7–0.9
  colsample_bytree: 0.5–0.8
  reg_lambda:    0.5–5.0

15. Assumptions

Assumption Notes
Twice-differentiable loss Required for gradient AND Hessian
IID samples Standard Supervised Learning assumption
No feature scaling needed Tree splits are scale-invariant
No distributional assumption Non-parametric β€” no normality or linearity required
No extrapolation Tree-based β€” flat outside training range
Moderate noise tolerance Better than AdaBoost; Hessian weighting down-weights uncertain samples

16. Advantages

βœ… Best-in-Class Accuracy (Tabular Data)

Consistently achieves top performance on tabular ML benchmarks. The standard to beat.

βœ… Second-Order Approximation

More accurate leaf values and split decisions than first-order GBM. Principled regularization via the split gain formula.

βœ… Flexible Loss Functions

Any twice-differentiable loss β€” including fully custom Python objectives.

βœ… GPU Training

tree_method='gpu_hist' provides 10–100x speedup on large datasets.

βœ… Native Missing Value Handling

Sparsity-aware split finding β€” no imputation, learns optimal default directions.

βœ… Rich Regularization

Ξ³ (pruning), Ξ» (L2), Ξ± (L1), min_child_weight, subsample, colsample β€” multiple orthogonal regularization axes.

βœ… Multiple Feature Importance Types

Weight, gain, cover β€” plus full SHAP support via shap.TreeExplainer.

βœ… Extensive Ecosystem

sklearn API via XGBClassifier, DMatrix native API, Spark/Dask/Ray integration, ONNX export, cuML compatibility.

βœ… Early Stopping

Built-in with eval_set and early_stopping_rounds.

βœ… Monotonic and Interaction Constraints

For regulated or domain-constrained models.


17. Drawbacks & Limitations

❌ Slower Than LightGBM on Large Datasets

LightGBM's leaf-wise growth and GOSS sampling are faster than XGBoost's depth-wise approach at scale (> 500k rows). XGBoost's gpu_hist closes this gap on GPU.

❌ No Native Categorical Support

Must one-hot encode or ordinal encode categoricals manually. CatBoost handles this natively and usually outperforms when categoricals dominate.

❌ Many Hyperparameters to Tune

More than sklearn GBM, though the defaults are reasonable. Tuning XGBoost well requires understanding the interaction between eta, max_depth, min_child_weight, gamma, and the regularization parameters.

❌ Sequential Training

Like all boosting β€” each tree depends on the previous. Cannot parallelize across trees. Internal feature parallelism helps but doesn't scale as linearly as Random Forest.

❌ No Extrapolation

Flat predictions outside the training range β€” inherits from Decision Trees.

❌ Memory for Column Blocks

Storing sorted column blocks requires O(m Β· p) additional memory β€” can be 2–3x the raw data size.


18. XGBoost vs. LightGBM vs. CatBoost

Property XGBoost [[LightGBM]] CatBoost
Speed (CPU) Fast βœ…βœ… Fastest Fast
Speed (GPU) βœ… Fast (gpu_hist) βœ… Fast βœ… Fast
Memory High (column blocks) Low (histograms) Moderate
Tree growth Depth-wise (default) Leaf-wise Oblivious (symmetric)
2nd order (Hessian) βœ… Yes βœ… Yes ❌ No
Categorical features ❌ Manual encoding ⚠️ Basic ordinal βœ…βœ… Native ordered
Missing values βœ… Native (sparse) βœ… Native βœ… Native
Monotonic constraints βœ… Yes βœ… Yes βœ… Yes
Custom loss βœ… Yes (grad+hess) βœ… Yes βœ… Yes
Regularization βœ… Rich (Ξ³,Ξ»,Ξ±,mcw) βœ… Good βœ… Good
SHAP support βœ… Best (TreeExplainer) βœ… Very good βœ… Good
sklearn API βœ… XGBClassifier βœ… LGBMClassifier βœ… CatBoostClassifier
Production maturity βœ…βœ… Very high βœ… High βœ… High
Best for General, imbalanced Very large data Categorical-heavy

19. Practical Tips & Gotchas

Canonical Fast Setup (sklearn API)

import xgboost as xgb

clf = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.7,
    reg_alpha=0.1,
    reg_lambda=2.0,
    scale_pos_weight=1,          # Adjust for class imbalance
    tree_method='hist',          # Fast for medium+ datasets
    eval_metric='logloss',
    early_stopping_rounds=50,
    use_label_encoder=False,
    n_jobs=-1,
    random_state=42
)

clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=100
)
print(f"Best round: {clf.best_iteration}, Best score: {clf.best_score}")

Native DMatrix API (Faster for Large Data)

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dval   = xgb.DMatrix(X_val,   label=y_val)
dtest  = xgb.DMatrix(X_test)

params = {
    'objective':        'binary:logistic',
    'eval_metric':      'auc',
    'eta':              0.05,
    'max_depth':        6,
    'min_child_weight': 5,
    'subsample':        0.8,
    'colsample_bytree': 0.7,
    'reg_lambda':       2.0,
    'reg_alpha':        0.1,
    'tree_method':      'hist',
    'seed':             42
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

preds = model.predict(dtest)

Custom Objective Function

def focal_loss_objective(y_pred, dtrain):
    """Focal loss β€” downweights easy examples for class imbalance"""
    y_true = dtrain.get_label()
    gamma = 2.0
    alpha = 0.25

    p = 1 / (1 + np.exp(-y_pred))
    pt = np.where(y_true == 1, p, 1 - p)
    at = np.where(y_true == 1, alpha, 1 - alpha)

    # Gradient
    grad = at * (1 - pt)**gamma * (gamma * pt * np.log(pt + 1e-7) + pt - y_true)

    # Hessian (approximate β€” use second derivative of focal loss)
    hess = at * (1 - pt)**gamma * (2 * gamma * pt * (1 - pt) * np.log(pt + 1e-7)
                                    + (1 - 2*pt) * gamma * (1 - pt) + pt * (1 - pt))

    return grad, hess

model = xgb.train(params, dtrain, obj=focal_loss_objective)

Class Imbalance

# Method 1: scale_pos_weight (simplest)
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
clf = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

# Method 2: max_delta_step (sometimes helps with severe imbalance)
clf = xgb.XGBClassifier(max_delta_step=1)

# Method 3: Adjust decision threshold post-hoc
from sklearn.metrics import [[Precision]]_recall_curve
probs = clf.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = [[Precision]]_recall_curve(y_val, probs)
# Pick threshold that maximizes F1

GPU Training

clf = xgb.XGBClassifier(
    tree_method='gpu_hist',   # GPU histogram
    device='cuda',            # XGBoost 2.0+ syntax
    n_estimators=2000,
    early_stopping_rounds=50
)

20. When to Use It

Use XGBoost when:

Consider [[LightGBM]] instead when:

Consider CatBoost instead when:


Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    XGBOOST AT A GLANCE                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  CORE MATH    2nd-order Taylor expansion β†’ closed-form split gain   β”‚
β”‚  SPLIT GAIN   Β½[GLΒ²/(HL+Ξ») + GRΒ²/(HR+Ξ») βˆ’ GΒ²/(H+Ξ»)] βˆ’ Ξ³          β”‚
β”‚  LEAF VALUE   w* = βˆ’G / (H + Ξ»)                                     β”‚
β”‚  REGULARIZE   Ξ³ (pruning), Ξ» (L2), Ξ± (L1), min_child_weight        β”‚
β”‚  MISSING      Sparsity-aware: learns default direction per split    β”‚
β”‚  GPU          tree_method='gpu_hist'  β€” 10–100x speedup            β”‚
β”‚  BEST PARAMS  LR=0.01–0.05 + n_est via ES + max_depth=4–8          β”‚
β”‚  STRENGTH     [[Accuracy]], flexibility, regularization, SHAP, ranking  β”‚
β”‚  WEAKNESS     Slower than LGB at scale, no native categoricals      β”‚
β”‚  BEST FOR     General-purpose tabular, custom objectives, ranking   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

XGBoost is the algorithm that taught the ML community that engineering and mathematics are not separate concerns β€” they compound. The second-order Taylor expansion is mathematically elegant; the column block is a systems insight; the weighted quantile sketch bridges both. The result was an algorithm that was simultaneously more principled and faster than its predecessors β€” proving that theoretical depth and engineering pragmatism reinforce each other. Every competitive ML practitioner needs to understand it at this level.

Powered by Forestry.md