LightGBM

1. What Is LightGBM?

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft Research, published at NeurIPS 2017. It implements the same gradient boosting objective as XGBoost but achieves dramatically faster training — particularly on large datasets — through three algorithmic innovations: histogram-based split finding, leaf-wise tree growth, and GOSS+EFB for data and feature reduction.

LightGBM has become the dominant gradient boosting library for large tabular datasets:

Fastest CPU training among major GBT implementations on most large datasets
Lowest memory footprint (histogram bins vs. sorted column blocks)
Best scalability — used at Microsoft, Alibaba, and other large-scale production systems
Frequently the top-performing algorithm in Kaggle competitions since 2018

2. The Three Core Innovations

LightGBM introduces three independent innovations, each solving a different bottleneck:

Problem 1: Split finding is O(m·p) per tree — slow for large m and p
Solution:  Histogram-based split finding — O(B·p) where B=255 << m

Problem 2: Level-wise growth wastes computation on low-gain leaves
Solution:  Leaf-wise growth — always split the highest-gain leaf regardless of depth

Problem 3: All m samples used for each tree — redundant for well-classified examples
Solution:  GOSS — keep all large-gradient samples, subsample small-gradient ones

Bonus:     p may be very large with many mutually exclusive features
Solution:  EFB — bundle mutually exclusive features to reduce effective p

Each innovation is independent and can be combined arbitrarily. Together, they make LightGBM 5–20x faster than XGBoost on CPU for large datasets while achieving equal or better Accuracy.

3. Histogram-Based Split Finding

3.1 Building the Histogram

Instead of sorting continuous feature values and evaluating every unique threshold (O(m·log m) per feature), LightGBM first bins each feature into at most max_bins discrete buckets:

Continuous values: [0.13, 0.87, 0.42, 0.19, 0.65, 0.33, ...]  (millions of values)
Bin boundaries:    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]  (9 boundaries → 10 bins)
Binned values:     [bin 1, bin 9, bin 4, bin 1, bin 7, bin 3, ...]

Binning is computed once at the Start of training (O(m·p)), then all trees reuse the same bins. For each node in each tree, LightGBM builds a gradient histogram over the bins:

For feature f, bin b:
    hist[f][b].grad_sum = Σ_{i: bin(xᵢf) = b} gᵢ    (sum of gradients in this bin)
    hist[f][b].hess_sum = Σ_{i: bin(xᵢf) = b} hᵢ    (sum of Hessians in this bin)

Split finding then scans the 255 bins (not millions of samples):

For each bin b from 1 to B-1:
    GL = Σ_{b'≤b} hist[f][b'].grad_sum
    HL = Σ_{b'≤b} hist[f][b'].hess_sum
    GR = G_total - GL
    HR = H_total - HL
    Gain(b) = ½ · [GL²/(HL+λ) + GR²/(HR+λ) − G²/(H+λ)] − γ

Complexity per tree node: O(B·p) = O(255·p) — independent of m! For p=100 features, that's 25,500 evaluations regardless of whether m is 10,000 or 10,000,000.

3.2 Histogram Subtraction Trick

When splitting a parent node into left and right children, building both children's histograms from scratch requires O(m_left · p + m_right · p) operations. LightGBM exploits a shortcut:

Parent histogram = Left child histogram + Right child histogram
→ Right child histogram = Parent histogram − Left child histogram

Build the smaller child's histogram from scratch (O(min(m_left, m_right) · p)), then subtract from the parent histogram (O(B·p)) to get the larger child's histogram. This halves the histogram construction work asymptotically.

3.3 Accuracy vs. Speed Tradeoff

More bins = higher Accuracy but slower (more candidates per feature):

max_bins = 15:    Very fast, lower Accuracy (coarse splits)
max_bins = 63:    Fast, good Accuracy
max_bins = 255:   Default — excellent Accuracy, fast
max_bins = 512:   Slower, marginal [[Accuracy]] gain for most data

The [[Accuracy]] loss from binning is typically negligible in practice — the optimal split threshold is unlikely to fall precisely between two training samples anyway.

4. Leaf-Wise Tree Growth (Best-First)

4.1 Level-Wise vs. Leaf-Wise

Level-wise (XGBoost default, sklearn GBM): Grow all leaves at depth d before going to depth d+1.

Level 1:   Split root node
Level 2:   Split both children (regardless of which has higher gain)
Level 3:   Split all 4 nodes at depth 2
...

Leaf-wise (LightGBM default): Always split the leaf with the highest gain, regardless of depth or tree balance.

[[Start]]:      Root node (gain = 100)
Step 1:     Split root → left (gain=80), right (gain=30)
Step 2:     Split LEFT (80 > 30) → left-left (gain=60), left-right (gain=15)
Step 3:     Split LEFT-LEFT (60 > 30 > 15) → ...

This produces asymmetric trees — one branch can be much deeper than another.

4.2 Why Leaf-Wise Is More Efficient

For the same number of leaves (same model complexity), leaf-wise achieves lower training loss than level-wise.

Level-wise with depth 4:   15 leaves
Leaf-wise with 15 leaves:  Lower loss (same capacity, better allocation)

The gain comes from concentration: leaf-wise always spends its "split budget" on the region where it will reduce loss the most. Level-wise wastes splits on low-gain regions at the same depth as high-gain ones.

Empirically: Leaf-wise typically needs fewer trees to achieve the same loss as level-wise, reducing training time proportionally.

4.3 Overfitting Risk and Control

Leaf-wise trees can grow very deep in one branch, potentially memorizing specific training samples. This is controlled by num_leaves — the single most important hyperparameter in LightGBM:

num_leaves:  Maximum total leaves across the entire tree
             (LightGBM stops adding leaves once this is reached)

Analogy:  num_leaves in LightGBM ≈ max_depth in level-wise trees
          BUT: num_leaves provides finer-grained capacity control

Key constraint: num_leaves ≤ 2^max_depth to prevent excessive depth.

Typical settings:
  num_leaves = 31:   Conservative, ~depth 5 equivalent, good for small-medium data
  num_leaves = 63:   Moderate, ~depth 6
  num_leaves = 127:  Aggressive, large datasets
  num_leaves = 255:  Deep trees, needs strong regularization

With leaf-wise growth, num_leaves replaces max_depth as the primary complexity control. This is the most common LightGBM tuning mistake — setting max_depth and forgetting num_leaves.

5. Gradient-Based One-Side Sampling (GOSS)

5.1 The Insight

In gradient boosting, the gradient gᵢ of a well-classified sample is small — the model has already learned to predict it correctly and changing the split decision matters little for this sample. The gradient is large for samples the current model gets badly wrong — these drive the learning.

GOSS observation: Samples with large gradients contribute disproportionately to the information gain of each split. Small-gradient samples are "easy" — we can safely ignore most of them without losing much split quality.

5.2 The Algorithm

At each boosting round:

1. Sort all samples by |gradient|
2. Keep the top a fraction (large gradients) → set A
3. From the remaining (1-a) fraction, randomly sample b fraction → set B
4. Amplify B's contribution by weight (1-a)/b to compensate for undersampling
5. Compute gain using A ∪ B (with weights)

Sample count used: a·m + b·(1-a)·m = (a + b - ab)·m  << m for small a,b

For example, a=0.2, b=0.1:

Large-gradient samples:  20% of m (all kept)
Small-gradient samples:  10% of remaining 80% = 8% of m (sampled, upweighted 10x)
Total used per tree:      28% of m  — ~3.5x speedup in split finding

5.3 Theoretical Guarantee

The GOSS paper proves an approximation bound:

|Gain_GOSS − Gain_full| ≤ O(1/√m)    with high probability

The error in the estimated gain decreases as the dataset grows — GOSS becomes more accurate on larger datasets. Smaller datasets should use full sampling (GOSS can be disabled by setting data_sample_strategy='bagging').

In sklearn HGBC: GOSS is not implemented — it uses standard subsampling. The full GOSS algorithm is unique to LightGBM.

6. Exclusive Feature Bundling (EFB)

6.1 The Problem

High-dimensional sparse datasets (text, one-hot encoded categoricals, interaction features) may have p = 50,000+ features. Even with histogram bins, O(B·p) = O(255 · 50,000) per split is slow.

Key observation: In sparse data, many features are mutually exclusive — they are never both non-zero for the same sample. For example, in one-hot encoded data, exactly one category feature is non-zero per sample.

6.2 Finding Exclusive Bundles

EFB frames bundling as a graph coloring problem:

Build a graph where:
    Nodes = features
    Edges = (fᵢ, fⱼ) if fᵢ and fⱼ are sometimes both non-zero

Find a graph coloring (assign each node a color/bundle such that
no two adjacent nodes have the same color)

Each color = one bundle → mutually exclusive features within each bundle

Exact graph coloring is NP-hard, so EFB uses a greedy approximation. A conflict rate threshold max_conflict_rate allows features that are "almost" exclusive (non-zero together for < max_conflict_rate fraction of samples) to be bundled:

lgb.LGBMClassifier(min_data_in_bin=3)   # Affects bundling granularity

6.3 Merging Features into Bundles

Once bundles are found, EFB merges features by offset addition:

Feature A: [1, 0, 0, 0, 3, 0]   (range 0–3, say max = 3)
Feature B: [0, 4, 0, 2, 0, 0]   (range 0–4, offset by 4)
Bundle:    [1, 8, 0, 6, 3, 0]   (B's values shifted: 4+0=4, 4+4=8, 4+2=6)

The merged bundle preserves all information — B's values are recoverable by subtracting the offset. LightGBM treats the bundle as a single feature with a wider bin range.

Result: For highly sparse data with many exclusive features, EFB can reduce the effective feature count by 10–100x, providing proportional speedup in split finding.

7. Categorical Feature Handling

LightGBM has a built-in categorical feature handling that avoids one-hot encoding:

Method: For each categorical feature at each split, LightGBM tries to find the optimal binary partition of all categories into two groups:

Categories: {A, B, C, D, E, F}
Find partition: {A, C, E} vs. {B, D, F}  that maximizes gain

The optimal partition is found efficiently by sorting categories by their gradient/Hessian ratio and trying the O(B) ordered partitions.

This is much better than one-hot encoding for high-cardinality categoricals:

One-hot encoding:  k new binary features → O(k) split candidates
LightGBM native:   1 feature → O(2^k) possible partitions, approximated in O(k) sorted splits

The sorted partition approach finds the approximately optimal binary split in O(k·log k) — effective for up to a few thousand categories.

clf = lgb.LGBMClassifier(
    categorical_feature=[0, 2, 5],  # Column indices
    # OR: mark in the Dataset constructor
)

# Or with Dataset API
train_data = lgb.Dataset(
    X_train, y_train,
    categorical_feature=['col_name_1', 'col_name_2']
)

Important: Categorical columns must be integer-encoded (not one-hot), and must be non-negative integers.

8. Missing Value Handling

LightGBM handles missing values natively — same approach as XGBoost's sparsity-aware algorithm:

NaN values are ignored when building histograms
After finding the optimal split, the algorithm tries sending NaN values to both child nodes
The direction that achieves lower loss is kept as the default direction for that split

# No imputation needed
clf = lgb.LGBMClassifier()
clf.fit(X_with_nans, y)   # Works directly

# Custom missing value marker
import pandas as pd
X_df = pd.DataFrame(X).replace(-999, float('nan'))
clf.fit(X_df, y)

9. Parallelism Strategies

9.1 Feature Parallel

Default for single machines. Each thread processes a different feature's histogram:

Thread 1: Build histogram for features 0–24
Thread 2: Build histogram for features 25–49
Thread 3: Build histogram for features 50–74
Thread 4: Build histogram for features 75–99

All threads synchronize to find the globally best split. Scales with the number of CPU cores.

9.2 Data Parallel

For distributed training across multiple machines. Each machine holds a partition of the data and builds local histograms, which are merged via all-reduce communication:

# Distributed with dask
import lightgbm as lgb
from lightgbm.dask import DaskLGBMClassifier

clf = DaskLGBMClassifier(n_estimators=500)
clf.fit(dask_X, dask_y)

9.3 Voting Parallel

An optimization of data parallel for very large datasets with many features: each machine votes on the K best local splits; the globally best split is selected from the union of votes. Reduces communication from O(p) to O(K) per round.

9.4 GPU Acceleration

clf = lgb.LGBMClassifier(device='gpu', gpu_platform_id=0, gpu_device_id=0)

LightGBM's GPU implementation uses histogram construction on GPU — typically 3–10x faster than CPU for dense data. Less impressive than XGBoost's gpu_hist for very sparse data.

10. Hyperparameters — Complete Reference

Parameter	Default	Description	Priority
`n_estimators`	100	Number of boosting rounds — use early stopping	High
`learning_rate`	0.1	Shrinkage factor — most important parameter	High
`num_leaves`	31	Primary complexity control — NOT max_depth	High
`max_depth`	-1	Maximum depth — use -1 (unlimited) with num_leaves	Medium
`min_child_samples`	20	Min samples in leaf — key regularizer for noise	High
`min_child_weight`	0.001	Min sum of Hessians in leaf	Medium
`subsample`	1.0	Row sampling per tree (bagging strategy)	Medium
`subsample_freq`	0	Perform bagging every k iterations (0 = no bagging)	Medium
`colsample_bytree`	1.0	Feature sampling fraction per tree	Medium
`reg_alpha`	0.0	L1 regularization on leaf weights	Low
`reg_lambda`	0.0	L2 regularization on leaf weights	Low
`min_split_gain`	0.0	Min gain to create a split (γ equivalent)	Low
`max_bin`	255	Number of histogram bins	Low
`min_data_in_bin`	3	Min samples per bin (affects binning)	Low
`cat_smooth`	10	Smoothing for categorical split gain	Low
`cat_l2`	10	L2 for categorical features	Low
`bagging_fraction`	1.0	Same as subsample — bagging variant	Medium
`feature_fraction`	1.0	Same as colsample_bytree	Medium
`early_stopping_round`	None	Stop if no improvement after k rounds	High
`n_jobs`	-1	Use all CPU cores	-
`device`	'cpu'	'gpu' for GPU training	-
`verbose`	1	-1 (silent), 0 (warnings), 1 (info)	-

The three most impactful params:

1. learning_rate + n_estimators (found via early stopping)
2. num_leaves  (primary model complexity)
3. min_child_samples  (primary overfitting control)

11. Callbacks and Training Loop

LightGBM's callback system allows fine-grained control of the training loop:

import lightgbm as lgb
from lightgbm import early_stopping, log_evaluation, record_evaluation

# Callbacks
results = {}
callbacks = [
    early_stopping(stopping_rounds=50),
    log_evaluation(period=100),
    record_evaluation(results)   # Store eval history in dict
]

clf = lgb.LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    num_leaves=63,
    min_child_samples=20,
    colsample_bytree=0.8,
    subsample=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    n_jobs=-1,
    random_state=42
)

clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=callbacks
)

print(f"Best iteration: {clf.best_iteration_}")
print(f"Best score: {clf.best_score_}")

Native Dataset API (Faster for Large Data)

import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train,
                          free_raw_data=False,   # Keep raw data for cross-validation
                          categorical_feature=[2, 5])

val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

params = {
    'objective':        'binary',
    'metric':           'auc',
    'learning_rate':    0.05,
    'num_leaves':       63,
    'min_child_samples':20,
    'colsample_bytree': 0.8,
    'subsample':        0.8,
    'reg_alpha':        0.1,
    'reg_lambda':       1.0,
    'verbose':          -1,
    'seed':             42
}

model = lgb.train(
    params,
    train_data,
    num_boost_round=2000,
    valid_sets=[train_data, val_data],
    valid_names=['train', 'val'],
    callbacks=[
        lgb.early_stopping(50),
        lgb.log_evaluation(100)
    ]
)

preds = model.predict(X_test)  # Returns probabilities directly

12. Feature Importance and SHAP

# Built-in importance (split frequency or gain)
clf.feature_importances_   # Split count by default

# Gain-based (usually more informative)
lgb.plot_importance(clf.booster_, importance_type='gain', max_num_features=20)

# SHAP values
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)

# Binary [[Classification]]: shap_values is list [neg_class, pos_class]
shap.summary_plot(shap_values[1], X_test)
shap.waterfall_plot(explainer(X_test)[0])
shap.dependence_plot('feature_name', shap_values[1], X_test)

LightGBM + SHAP: TreeExplainer is fully supported and efficient. For large datasets, use approximate=True or check_additivity=False for faster computation.

13. The Bias-Variance Profile

The leaf-wise growth strategy shifts the bias-variance behavior:

num_leaves → large:    Lower bias, higher variance (deeper trees)
num_leaves → small:    Higher bias, lower variance (shallow trees)
min_child_samples → large: Higher bias, lower variance (larger leaves required)
learning_rate → small: Lower bias (more trees to converge) with same variance structure

Overfitting signature in LightGBM:

Train AUC: 0.99, Val AUC: 0.85 → Classic overfit
Fix: Reduce num_leaves, increase min_child_samples, add subsample/colsample

Underfitting signature:

Train AUC: 0.75, Val AUC: 0.74 → Model too simple
Fix: Increase num_leaves, decrease learning_rate, increase n_estimators

14. Assumptions

Assumption	Notes
Differentiable loss	Gradient and Hessian required
No feature scaling needed	Tree splits are scale-invariant
IID samples	Standard; concept drift degrades performance
No distributional assumption	Non-parametric — no normality or linearity
No extrapolation	Flat predictions outside training range
GOSS [[Accuracy]] improves with m	GOSS bound tightens with more data — use full bagging on small data

15. Advantages

✅ Fastest CPU Training Among Major GBT Libraries

GOSS + EFB + histograms + leaf-wise growth combine to give 5–20x speedup over XGBoost on CPU for large datasets.

✅ Lowest Memory Usage

Histogram bins (O(B·p)) instead of sorted column blocks (O(m·p)) — dramatically lower memory, especially for large datasets.

✅ Excellent Scalability

Handles hundreds of millions of rows. Distributed training via native MPI, Dask, or Spark integration.

✅ Native Categorical Features

Sorted partition approach handles high-cardinality categoricals without one-hot explosion.

✅ Native Missing Values

Learned default directions — no imputation required.

✅ Monotonic Constraints

clf = lgb.LGBMClassifier(monotone_constraints=[1, 0, -1])

✅ Feature Interaction Constraints

clf = lgb.LGBMClassifier(interaction_constraints=[[0,1],[2,3,4]])

✅ SHAP Integration

Full TreeExplainer support — fast exact SHAP values for all tree predictions.

✅ Strong Default Performance

LightGBM's defaults are well-chosen. A properly tuned LightGBM often beats a heavily tuned XGBoost on large data.

✅ Active Development

Microsoft continues actively developing LightGBM — new features, performance improvements, and GPU enhancements appear regularly.

16. Drawbacks & Limitations

❌ Leaf-Wise Overfitting on Small Datasets

Leaf-wise growth is aggressive — with small datasets (< 10,000 rows), it tends to overfit more easily than level-wise growth. Use min_child_samples aggressively or switch to XGBoost/sklearn GBM.

❌ Sensitive to Hyperparameter Tuning

More so than XGBoost for small data. num_leaves can cause catastrophic overfitting if set too high without min_child_samples compensation.

❌ Less Interpretable Importance than XGBoost

LightGBM's feature importance types (split/gain) are slightly less principled than XGBoost's cover-weighted variants. Use SHAP for production interpretability.

❌ GOSS Can Hurt on Small Data

GOSS sampling removes small-gradient samples — fine for large datasets where these are truly "well-learned," but can remove important examples in small datasets. Disable with data_sample_strategy='bagging'.

❌ No Native Second-Order Leaf Values (with GOSS)

When GOSS is active, the Hessian estimates are reweighted — introduces approximation error compared to XGBoost's exact second-order computation on all samples.

❌ GPU Less Effective for Sparse Data

XGBoost's gpu_hist is more optimized for sparse data (learned sparse structures). LightGBM's GPU mainly accelerates dense histogram construction.

17. LightGBM vs. XGBoost vs. CatBoost

Property	LightGBM	XGBoost	CatBoost
Speed (CPU, large)	✅✅ Fastest	✅ Fast	✅ Fast
Speed (GPU)	✅ Fast	✅ Fast (gpu_hist)	✅ Fast
Memory	✅✅ Lowest	Moderate	Moderate
Tree growth	Leaf-wise	Depth-wise (default)	Oblivious (symmetric)
2nd order	✅ Yes (approx GOSS)	✅ Yes (exact)	❌ 1st order
Categorical	✅ Sorted partition	❌ Manual	✅✅ Ordered encoding
Missing values	✅ Native	✅ Native	✅ Native
Small data	❌ Careful tuning	✅ Better	✅ Best
Large data	✅✅ Best	✅ Good	✅ Good
Distributed training	✅ Native MPI/Dask	✅ Dask/Ray	✅ Custom
SHAP support	✅ Very good	✅ Best	✅ Good
Hyperparameter tuning	⚠️ Sensitive	Moderate	✅ Forgiving
Best for	Large tabular data	General purpose	Categorical-heavy

18. Practical Tips & Gotchas

Most Common Mistake: Setting max_depth Instead of num_leaves

# WRONG — max_depth alone doesn't control LightGBM well
clf = lgb.LGBMClassifier(max_depth=6)

# RIGHT — num_leaves is the primary control
clf = lgb.LGBMClassifier(num_leaves=63, min_child_samples=20)

Canonical Setup for Large Datasets

import lightgbm as lgb

clf = lgb.LGBMClassifier(
    n_estimators=10000,          # High — early stopping will cut it
    learning_rate=0.05,
    num_leaves=127,
    min_child_samples=50,        # Critical for noisy/small-ish data
    colsample_bytree=0.7,
    subsample=0.8,
    subsample_freq=1,
    reg_alpha=0.1,
    reg_lambda=1.0,
    n_jobs=-1,
    verbose=-1,
    random_state=42
)

clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[
        lgb.early_stopping(stopping_rounds=100, verbose=True),
        lgb.log_evaluation(period=200)
    ]
)

Hyperparameter Optimization with Optuna

import optuna

def objective(trial):
    params = {
        'n_estimators':      2000,
        'learning_rate':     trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
        'num_leaves':        trial.suggest_int('num_leaves', 20, 300),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'colsample_bytree':  trial.suggest_float('colsample_bytree', 0.4, 1.0),
        'subsample':         trial.suggest_float('subsample', 0.4, 1.0),
        'reg_alpha':         trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda':        trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
        'n_jobs': -1, 'verbose': -1, 'random_state': 42
    }
    clf = lgb.LGBMClassifier(**params)
    clf.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            callbacks=[lgb.early_stopping(50, verbose=False)])
    return clf.best_score_['valid_0']['binary_logloss']

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)

Handle Class Imbalance

# Method 1: is_unbalance (auto-balances by adjusting class weights)
clf = lgb.LGBMClassifier(is_unbalance=True)

# Method 2: scale_pos_weight (manual ratio)
scale = (y_train == 0).sum() / (y_train == 1).sum()
clf = lgb.LGBMClassifier(scale_pos_weight=scale)

# Method 3: class_weight (sklearn-compatible)
clf = lgb.LGBMClassifier(class_weight='balanced')

Categorical Features Properly

import pandas as pd

# Must be dtype 'category' in pandas OR specified explicitly
X_train['city'] = X_train['city'].astype('category')

clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)   # Detects category dtype automatically

# OR: specify manually
clf = lgb.LGBMClassifier(categorical_feature=['city', 'country'])

19. When to Use It

Use LightGBM when:

Dataset is large to very large (> 100k rows — this is where LightGBM shines)
Training speed is important — CPU or cost-constrained environments
Memory is limited
Distributed training is needed (Dask, Spark, MPI)
Categorical features are important (better than manual encoding)
You want the fastest path to a competitive model on large data
Hyperparameter tuning at scale with Optuna/Ray Tune

Consider XGBoost instead when:

Dataset is medium-sized (10k–500k rows) where XGBoost and LightGBM are comparable
You need custom loss functions with full second-order [[Accuracy]]
SHAP explanations at production scale (XGBoost's TreeExplainer is better integrated)
You're doing ranking (XGBoost's LambdaMART is more mature)

Consider CatBoost instead when:

Categorical features dominate and you want the best categorical handling
Minimal tuning is required (CatBoost defaults are strongest out-of-box)
Dataset is small to medium where CatBoost's ordered boosting helps

Summary

┌─────────────────────────────────────────────────────────────────────┐
│                   LIGHTGBM AT A GLANCE                              │
├─────────────────────────────────────────────────────────────────────┤
│  CORE SPEED   Histogram bins + leaf-wise + GOSS + EFB              │
│  KEY PARAM    num_leaves (not max_depth!) + min_child_samples       │
│  GROWTH       Leaf-wise: always split highest-gain leaf             │
│  GOSS         Keep all large-gradient + subsample small-gradient    │
│  EFB          Bundle mutually exclusive features → reduce p         │
│  CATEGORICAL  Sorted partition — no one-hot needed                  │
│  MISSING      Learned default direction per split                   │
│  STRENGTH     Fastest, lowest memory, best for large data           │
│  WEAKNESS     Overfit on small data, GOSS error on small samples    │
│  BEST FOR     Large tabular datasets, speed-constrained training    │
└─────────────────────────────────────────────────────────────────────┘

LightGBM is what happens when you ask "which parts of gradient boosting are truly necessary?" The answer: not all samples (GOSS), not all feature values (histograms), not all features at once (EFB), and not all leaves at each level (leaf-wise). Each optimization attacks a different bottleneck, and together they produce an algorithm that runs on a laptop what previously required a cluster. The insight that small-gradient samples contribute little to split quality is not just empirically useful — it is a deep observation about the structure of gradient boosting's information content. LightGBM doesn't cut corners; it identifies which corners weren't load-bearing in the first place.