CatBoost

1. What Is CatBoost?

CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex and released in 2017. It is the only major GBT library that treats categorical features as first-class citizens — no preprocessing required, no one-hot encoding, no target leakage.

CatBoost's two central mathematical contributions are:

  1. Ordered Target Encoding — a leak-free way to encode categorical features using an online statistics scheme that prevents the classic target leakage problem

  2. Ordered Boosting — a novel boosting procedure that eliminates "prediction shift," a subtle bias in standard gradient boosting

These are built on a third structural decision: using oblivious Decision Trees (symmetric trees where every node at the same depth uses the same split) as base learners — enabling extremely fast prediction and natural regularization.

2. The Central Problem CatBoost Solves

Nearly every real-world tabular dataset contains categorical features:

Standard approaches:

Method Problem
One-hot encoding Explodes feature space for high-cardinality categories (k → k binary features)
Label encoding Imposes arbitrary ordinal structure (A=1, B=2, C=3 is meaningless)
Target encoding Uses mean(y) per category — leaks target into features during training
Frequency encoding Loses relationship between category and target
Embeddings Requires additional neural network; expensive

The target leakage problem is subtle and important. Suppose you encode a category by its training set mean target:

Category "New York" → enc = mean(y | category = "New York") = 0.73

Now when the model trains on a sample from New York, it sees enc=0.73 which was computed using that very sample's y value (and others). The feature has already "seen" the target — the model is using the answer to predict the answer. This leads to overfit encodings and inflated training Accuracy.

CatBoost solves this completely.


3. Ordered Target Encoding — The Core Innovation

3.1 Why Standard Target Encoding Leaks

For a training sample (xᵢ, yᵢ) with a categorical feature c:

Standard target encoding:

enc(c) = (Σⱼ 𝟙[cⱼ=c] · yⱼ) / (Σⱼ 𝟙[cⱼ=c])

This sum includes i itself — the encoding contains information about yᵢ. When the model later tries to predict yᵢ using enc(c), it's partially using yᵢ to predict yᵢ — target leakage.

The effect: Training loss appears lower than it is. The model appears to generalize better than it does. Test performance is disappointing relative to training.


3.2 CatBoost's Ordered Statistics

CatBoost's solution: only use samples that appeared "before" sample i when computing the encoding for sample i.

enc_i(c) = (count_before_i(c) · mean_target_before_i(c) + a · p̄) / (count_before_i(c) + a)

Where:

With smoothing:

enc_i(c) = (n_c_before · ȳ_c_before + a · p̄) / (n_c_before + a)

For the first few samples of category c (n_c_before is small), the encoding shrinks toward the global mean p̄. As more samples of category c are seen, the encoding approaches the empirical mean for that category.


3.3 The Permutation Trick

CatBoost processes training samples in a random permutation σ. Sample i's encoding uses only the subset of samples {σ(1), σ(2), ..., σ(i-1)} that appear before it in the permutation.

Random permutation σ = [sample 47, sample 3, sample 91, sample 12, ...]

For sample 91 (3rd in permutation):
    enc(category) is computed from samples 47 and 3 only
    → sample 91's target is NOT in its own encoding
    → No target leakage

Since the permutation is random, no sample has a systematic advantage or disadvantage from appearing early. Early samples have few "predecessors" and noisy encodings (shrunk toward the prior); late samples have more predecessors and more accurate encodings. The model must learn features that generalize across encoding quality — a natural regularizer.

CatBoost uses multiple permutations (one per tree, by default) to reduce variance:

CatBoostClassifier(leaf_estimation_iterations=1)   # One permutation
# Each tree uses a different random permutation

4. Ordered Boosting

4.1 Prediction Shift in Standard GBM

Standard gradient boosting has a subtle bias called prediction shift. In round t, the pseudo-residuals are:

rᵢₜ = −∂L(yᵢ, Fₜ₋₁(xᵢ)) / ∂Fₜ₋₁(xᵢ)

The model Fₜ₋₁ was trained on the same samples that include xᵢ. This means the residuals rᵢₜ are computed using a model that has already "seen" xᵢ during training — Fₜ₋₁ is biased toward fitting xᵢ well, making rᵢₜ underestimate the true generalization residual.

This is prediction shift: the pseudo-residuals have conditional bias because F_{t-1} overfits the training data.

Consequence: The boosting algorithm gets over-optimistic pseudo-residuals → trees focus too much on already-well-fitted examples → subtle overfitting.


4.2 CatBoost's Ordered Mode

CatBoost's ordered boosting maintains a sequence of models, one for each possible "dataset prefix":

For each training sample i (in permutation order):
    Build Mᵢ = model trained on {samples before i in permutation}
    Compute residual rᵢ = −∂L(yᵢ, Mᵢ(xᵢ)) / ∂Mᵢ(xᵢ)
    → rᵢ is computed by a model that has NOT seen xᵢ → unbiased residual

The new tree is fit to the unbiased residuals {rᵢ}. This prevents prediction shift.

In practice: CatBoost uses a memory-efficient implementation that doesn't literally maintain m separate models. Instead, it maintains a tree structure that can efficiently retrieve the model appropriate for each sample's ordered context.

The result: CatBoost's pseudo-residuals are less biased than standard GBM's → better generalization, especially on small-to-medium datasets where overfitting to training residuals is most problematic.


5. Oblivious Decision Trees

5.1 What Are Oblivious Trees?

A standard decision tree has different splits at each node — the split at the root might use feature A, the left child uses feature B, the right child uses feature C.

An oblivious tree uses the same split at every node at the same depth level:

Standard tree:           Oblivious tree:
         f₁ < 0.5               f₁ < 0.5
        /         \             /         \
   f₂ < 1.0    f₃ < 0.3    f₂ < 1.0    f₂ < 1.0   ← same split at depth 2
   /     \      /     \    /     \      /     \
  L₀    L₁   L₂    L₃  L₀    L₁   L₂    L₃

An oblivious tree of depth d has exactly 2^d leaves. The path from root to any leaf can be described as a binary vector indicating which side of each threshold the sample falls on at each depth level.


5.2 Why They Work Well

Regularization: Oblivious trees are significantly simpler than standard trees of the same depth. By forcing the same split at each level, the model cannot overfit the specific path to any training sample. This is implicit regularization.

Bias-variance: Higher bias than standard trees at the same depth; lower variance. More trees are needed to compensate, but each tree overfits less.

Feature selection across splits: The oblivious constraint means that the chosen split feature at each depth is the one that, applied to all nodes at that depth, produces the greatest gain. This selects features that are globally useful at that abstraction level, not just locally optimal.


5.3 Fast Prediction with Oblivious Trees

Each leaf in an oblivious tree is indexed by a binary vector of depth-d decisions. Prediction is a table lookup:

depth = 4 → each tree is a lookup table of 2^4 = 16 leaf values
For sample x:
    key = [x[f₁] < t₁, x[f₂] < t₂, x[f₃] < t₃, x[f₄] < t₄]  (bit vector)
    value = leaf_table[key]   (O(1) lookup!)

This makes CatBoost's prediction extremely fast — significantly faster than standard asymmetric trees, especially in latency-sensitive production environments.

For GPU inference, the bitwise operations and table lookups are massively parallelizable. CatBoost's GPU prediction speed is among the fastest of any ML algorithm.


6. Categorical Feature Combinations

CatBoost doesn't just encode individual categorical features — it automatically creates feature combinations:

Individual features:
    city = "Paris"
    device = "mobile"

Combination:
    (city, device) = "Paris + mobile"  → target-encoded as a joint category

Combinations can capture complex interactions that individual features miss:

CatBoost creates combinations greedily during tree construction: at each node, it considers adding a categorical feature combined with the split path from the root. The number of combinations is controlled by max_ctr_complexity:

CatBoostClassifier(max_ctr_complexity=2)   # Allow pairwise combinations
CatBoostClassifier(max_ctr_complexity=4)   # Allow 4-way combinations (slow but powerful)

7. Numerical Feature Handling

CatBoost handles numerical features with histogram-based split finding — similar to LightGBM and XGBoost's hist mode:

CatBoost uses random permutations of the training data (same permutations used for ordered statistics) to add variety to the binning process across trees — a mild form of stochastic boosting.

For numerical features, CatBoost is competitive with but not always faster than LightGBM — the oblivious constraint limits the gain achievable per tree, requiring more trees to compensate.


8. Embeddings and Text Features

CatBoost is the only major GBT library with built-in text feature processing and embedding support:

Text Features

clf = CatBoostClassifier(
    text_features=['review_text', 'product_description']
)
clf.fit(X_train, y_train)

Internally, CatBoost tokenizes the text and creates count-based features (bag of words or n-grams), then applies its ordered encoding to the resulting token counts.

Embedding Features

clf = CatBoostClassifier(
    embedding_features=['user_embedding', 'item_embedding']
)

Accepts pre-computed embedding vectors as features, with special distance-based split finding that's aware of the embedding space structure.

These features make CatBoost particularly useful for e-commerce, search ranking, and recommendation systems where text and embedding features coexist with tabular features.


9. Missing Value Handling

CatBoost handles missing values through the oblivious tree structure:

CatBoostClassifier(nan_mode='Min')   # Send NaN to minimum value side
CatBoostClassifier(nan_mode='Max')   # Send NaN to maximum value side
CatBoostClassifier(nan_mode='Forbidden')   # Raise error on NaN

No imputation required for either numeric or categorical NaN values.


10. GPU Training Architecture

CatBoost has one of the most optimized GPU training architectures among GBT libraries:

Oblivious trees + GPU = natural fit:

CatBoost GPU advantages:

from catboost import CatBoostClassifier

clf = CatBoostClassifier(
    task_type='GPU',
    devices='0',   # GPU device ID, or '0:1' for two GPUs
    iterations=1000,
    depth=6,
    learning_rate=0.05
)

CatBoost's GPU is particularly fast for categorical-heavy datasets — the ordered statistics computation benefits enormously from GPU parallelism.


11. Hyperparameters — Complete Reference

Parameter Default Description Priority
iterations 1000 Number of trees (n_estimators) High
learning_rate auto Shrinkage — auto-tuned based on iterations High
depth 6 Tree depth — oblivious trees High
l2_leaf_reg 3.0 L2 regularization on leaf values High
bagging_temperature 1.0 Bayesian bootstrap intensity (0=no sampling, higher=more) Medium
subsample N/A Use with bootstrap_type='Bernoulli' Medium
random_strength 1.0 Randomness for split scoring — adds noise to splits Medium
border_count 254 Histogram bins for numerical features Low
cat_features None List of categorical feature names/indices Required
one_hot_max_size 2 One-hot encode categories with ≤ this many values Medium
max_ctr_complexity 4 Max categorical feature combinations Medium
leaf_estimation_method 'Newton' 'Newton' (2nd order) or 'Gradient' (1st order) Low
leaf_estimation_iterations 1 Steps for leaf value computation Low
boosting_type 'Ordered' 'Ordered' (default, better) or 'Plain' (faster) High
eval_metric auto Metric for early stopping Medium
early_stopping_rounds None Patience for early stopping High
task_type 'CPU' 'CPU' or 'GPU' -
verbose 1 Frequency of evaluation output -
random_seed 0 Random seed -
nan_mode 'Min' How to handle missing values Low
text_features None Column names for text features Optional
embedding_features None Column names for embedding features Optional
monotone_constraints None Dict of feature → Optional
feature_weights None Per-feature sampling weights Low
penalties_coefficient 1.0 Scale for complexity penalties Low

Priority parameters:

1. depth (3–10 — deeper than LightGBM due to oblivious constraint)
2. learning_rate + iterations (early stopping recommended)
3. l2_leaf_reg (main regularizer — higher = smoother)
4. random_strength (noise on splits — prevents overfitting)
5. bagging_temperature (bootstrap intensity)

12. Feature Importance and SHAP

Built-in Importance Types

# PredictionValuesChange — how much predictions change when feature is varied
clf.get_feature_importance(type='PredictionValuesChange')

# LossFunctionChange — how much loss increases if feature is removed
clf.get_feature_importance(type='LossFunctionChange')   # Most reliable

# ShapValues — SHAP for each sample
clf.get_feature_importance(type='ShapValues', data=test_pool)

# Interaction SHAP
clf.get_feature_importance(type='ShapInteractionValues', data=test_pool)

# Internal importance
clf.get_feature_importance(type='InternalFeatureImportance')

CatBoost + SHAP

from catboost import Pool
import shap

test_pool = Pool(X_test, cat_features=['city', 'device'])
shap_values = clf.get_feature_importance(
    type='ShapValues',
    data=test_pool
)

# shap_values shape: (n_samples, n_features + 1) — last column is base value
shap.summary_plot(shap_values[:, :-1], X_test)
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0, :-1],
    base_values=shap_values[0, -1],
    data=X_test.iloc[0]
))

CatBoost computes SHAP values natively via its own efficient implementation — no separate shap library call needed (though shap.TreeExplainer also works).


13. Model Analysis and Visualization Tools

CatBoost provides richer built-in analysis tools than XGBoost or LightGBM:

from catboost import CatBoostClassifier
from catboost.utils import get_roc_curve, get_confusion_matrix
import catboost

# Training with plots (interactive in Jupyter)
clf = CatBoostClassifier(plot=True)
clf.fit(train_pool, eval_set=val_pool)

# ROC curve
curve = get_roc_curve(clf, val_pool)

# Prediction distribution
clf.plot_predictions(val_pool)

# Feature statistics (distributions per class)
clf.plot_features_distribution(val_pool)

# SHAP-based feature contribution plot
clf.plot_feature_importance(val_pool, plot=True)

# Select threshold for binary [[Classification]]
catboost.select_threshold(clf, data=val_pool, FPR=0.05)

These tools make CatBoost particularly useful for exploratory analysis and model auditing in business contexts.


14. The Bias-Variance Profile

CatBoost's oblivious trees introduce a different bias-variance tradeoff than XGBoost/LightGBM:

Oblivious tree at depth d:   Higher bias, lower variance than standard tree at depth d
                              → needs more trees to achieve same bias reduction
                              → but each tree overfits less → better small-data behavior

Ordered boosting further reduces variance by using unbiased pseudo-residuals:

Ordered boosting advantage:   Larger at small n, diminishes as n → ∞
                               At n > 100k, 'Plain' mode is often equally good

This is why CatBoost is often recommended for small-to-medium datasets — where prediction shift and overfitting are most problematic. The ordered mechanisms provide more conservative, robust learning.

Small data (< 10k):   CatBoost often best — ordered boosting helps most
Medium data (10k-1M): Comparable to XGBoost/LightGBM
Large data (> 1M):    LightGBM may be faster, CatBoost competitive

15. Assumptions

Assumption Notes
IID samples Ordered boosting requires random ordering — assumes i.i.d.
No feature scaling required Oblivious trees are scale-invariant
No distributional assumption Non-parametric
No extrapolation Tree-based — flat predictions outside training range
Sufficient cardinality Ordered encoding benefits categories with ≥ a few occurrences
Clean categories Typos or inconsistent naming in categorical columns will create spurious categories

16. Advantages

✅ Best Categorical Feature Handling

No preprocessing required. CatBoost handles high-cardinality categoricals, missing categoricals, and categorical combinations natively — and usually produces better results than manual encoding.

✅ No Target Leakage in Categorical Encoding

Ordered target encoding prevents the subtle leakage that plagues standard target encoding — particularly important for small datasets where overfitting is severe.

✅ Strong Default Performance

CatBoost's defaults are the most carefully chosen of the three major GBT libraries. auto learning rate, depth=6, l2_leaf_reg=3 — these often produce competitive results without tuning.

✅ Fast Prediction

Oblivious trees implement prediction as bit-vector table lookups — O(depth) per sample with extremely low constant factors. CatBoost is often the fastest at inference time.

✅ GPU Training

One of the fastest GPU GBT implementations, especially for categorical-heavy data.

✅ Rich Interpretability Tools

Built-in SHAP, interaction SHAP, LossFunctionChange importance, prediction change tools — more comprehensive than XGBoost/LightGBM.

✅ Text and Embedding Features

Unique among GBT libraries — handles text and pre-computed embeddings natively.

✅ Ordered Boosting for Small Data

Unbiased pseudo-residuals → better generalization on small datasets where prediction shift is a real problem.

✅ Minimal Preprocessing

Handles categoricals, NaN, text, and embeddings natively — reduces data preprocessing burden substantially.


17. Drawbacks & Limitations

❌ Slower on Pure Numerical Datasets

When data has no categorical features, CatBoost's oblivious tree structure provides no advantage and may be slower than LightGBM's leaf-wise histograms. The ordered boosting overhead also slows training.

❌ Memory Intensive for Ordered Boosting

Maintaining ordered statistics requires storing cumulative statistics per permutation — memory increases proportionally with the number of permutations and categorical features.

boosting_type='Ordered' Slow on Large Data

Ordered boosting's O(n log n) complexity is fine for small-medium data but makes CatBoost slower than LightGBM for datasets > 1M rows. Use boosting_type='Plain' for large data.

❌ Depth Limit for Oblivious Trees

A depth-10 oblivious tree has only the same expressive power as a depth-10 standard tree along a single feature path. CatBoost needs higher depth or more trees to match the same capacity as LightGBM's leaf-wise asymmetric trees.

❌ Limited Ecosystem Compared to XGBoost

XGBoost has deeper integration with Dask, Spark, ONNX, and cloud ML frameworks. CatBoost's ecosystem is narrower.

❌ Fewer Custom Loss Options

CatBoost supports custom objectives, but the interface is less flexible than XGBoost's grad/hess system for complex custom losses.


18. CatBoost vs. LightGBM vs. XGBoost

Property CatBoost [[LightGBM]] XGBoost
Categorical features ✅✅ Best (native) ✅ Good (sorted part.) ❌ Manual encoding
Target leakage ✅ None (ordered) ⚠️ Standard encoding ⚠️ Standard encoding
Training speed (CPU) ⚠️ Slower (ordered) ✅✅ Fastest ✅ Fast
Training speed (GPU) ✅ Very fast ✅ Fast ✅ Fast
Prediction speed ✅✅ Fastest (oblivious) ✅ Fast ✅ Fast
Small data ✅✅ Best (ordered) ⚠️ Careful tuning ✅ Good
Large data ✅ Good ✅✅ Best ✅ Good
Memory Moderate-High ✅✅ Lowest Moderate
Default performance ✅✅ Best defaults ✅ Good ✅ Good
Hyperparameter tuning ✅ Forgiving ⚠️ Sensitive Moderate
Text/embedding features ✅✅ Unique
SHAP support ✅ Native + shap ✅ shap ✅ Best (shap)
Interpretability tools ✅✅ Rich built-in ✅ Basic ✅ Good
Ecosystem integration ⚠️ Narrower ✅ Good ✅✅ Widest
Best for Categorical data, small-medium Large data General purpose

19. Practical Tips & Gotchas

Always Specify cat_features

from catboost import CatBoostClassifier, Pool

# Identify categorical columns
cat_features = [col for col in X_train.columns
                if X_train[col].dtype == 'object' or X_train[col].dtype.name == 'category']

# Using Pool (recommended for large datasets)
train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool   = Pool(X_val,   y_val,   cat_features=cat_features)

clf = CatBoostClassifier(
    iterations=1000,
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=3.0,
    early_stopping_rounds=50,
    eval_metric='AUC',
    verbose=100,
    random_seed=42
)
clf.fit(train_pool, eval_set=val_pool)

Use 'Plain' Boosting for Large Datasets

# For n > 100k rows, ordered boosting overhead isn't worth it
clf = CatBoostClassifier(
    boosting_type='Plain',   # Much faster; similar [[Accuracy]] at large n
    iterations=2000,
    learning_rate=0.05,
    depth=8,
    early_stopping_rounds=100,
    verbose=200
)

Canonical Hyperparameter Tuning

import optuna
from catboost import CatBoostClassifier, Pool

train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool   = Pool(X_val,   y_val,   cat_features=cat_features)

def objective(trial):
    params = {
        'iterations':         500,
        'depth':              trial.suggest_int('depth', 4, 10),
        'learning_rate':      trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
        'l2_leaf_reg':        trial.suggest_float('l2_leaf_reg', 1.0, 20.0, log=True),
        'random_strength':    trial.suggest_float('random_strength', 0.1, 10.0, log=True),
        'bagging_temperature':trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'border_count':       trial.suggest_categorical('border_count', [32, 64, 128, 254]),
        'boosting_type':      'Plain',
        'eval_metric':        'AUC',
        'verbose':            0,
        'random_seed':        42
    }
    clf = CatBoostClassifier(**params)
    clf.fit(train_pool, eval_set=val_pool, early_stopping_rounds=30)
    return clf.best_score_['validation']['AUC']

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Custom Loss Function

class CustomLoss:
    def calc_ders_range(self, approxes, targets, weights):
        """Return list of (gradient, hessian) per sample"""
        assert len(approxes) == len(targets)
        result = []
        for approx, target in zip(approxes, targets):
            p = 1.0 / (1.0 + np.exp(-approx))
            result.append((target - p, p * (1 - p)))  # (grad, hess)
        return result

clf = CatBoostClassifier(loss_function=CustomLoss())

Model Persistence and Loading

# Save model
clf.save_model('catboost_model.cbm')

# Load model
from catboost import CatBoostClassifier
clf_loaded = CatBoostClassifier()
clf_loaded.load_model('catboost_model.cbm')

# Export to ONNX
clf.save_model('model.onnx', format='onnx',
               export_parameters={'onnx_domain': 'ai.catboost'})

Feature Importance Pipeline

# LossFunctionChange — most reliable importance type
importance = clf.get_feature_importance(
    type='LossFunctionChange',
    data=val_pool
)

import pandas as pd
feat_imp = pd.Series(importance, index=feature_names).sort_values(ascending=False)
print(feat_imp.head(20))

# SHAP
shap_vals = clf.get_feature_importance(type='ShapValues', data=val_pool)
import shap
shap.summary_plot(shap_vals[:, :-1], X_val, feature_names=feature_names)

20. When to Use It

Use CatBoost when:

Use [[LightGBM]] instead when:

Use XGBoost instead when:


Summary

┌─────────────────────────────────────────────────────────────────────┐
│                   CATBOOST AT A GLANCE                              │
├─────────────────────────────────────────────────────────────────────┤
│  CORE IDEA    Ordered encoding + ordered boosting + oblivious trees │
│  CATEGORICAL  Ordered target encoding — no preprocessing, no leakage│
│  TREE TYPE    Oblivious (symmetric) — same split at each depth level│
│  PREDICTION   Bit-vector table lookup → fastest inference          │
│  ORDERING     Random permutation per tree → unbiased pseudo-residuals│
│  KEY PARAMS   depth, l2_leaf_reg, random_strength, iterations       │
│  STRENGTH     Categoricals, small data, defaults, fast prediction   │
│  WEAKNESS     Slow on large pure-numerical data (use Plain mode)    │
│  vs LGB/XGB   Best for cat features; LGB wins on large numeric data │
│  BEST FOR     Categorical-heavy data, small-medium, fast inference  │
└─────────────────────────────────────────────────────────────────────┘

CatBoost asked the question that [[XGBoost]] and [[LightGBM]] implicitly avoided: what about the data that every real-world dataset actually contains? Categorical features are not an edge case — they are the norm in e-commerce, fraud detection, medical records, and search ranking. The industry response to this was years of careful feature engineering: mean encoding, frequency encoding, one-hot explosions, and target encoding with cross-validation tricks. CatBoost replaced all of this with a principled statistical solution — ordered statistics that cannot leak, evaluated on permutations that cannot overfit, in trees that cannot meander. The result is an algorithm that requires almost no preprocessing for the most common type of real-world tabular data. That is not a minor contribution; it is a fundamental simplification of how we use gradient boosting in practice.

Powered by Forestry.md