CatBoost
1. What Is CatBoost?
CatBoost (Categorical Boosting) is a gradient boosting library developed by Yandex and released in 2017. It is the only major GBT library that treats categorical features as first-class citizens — no preprocessing required, no one-hot encoding, no target leakage.
CatBoost's two central mathematical contributions are:
-
Ordered Target Encoding — a leak-free way to encode categorical features using an online statistics scheme that prevents the classic target leakage problem
-
Ordered Boosting — a novel boosting procedure that eliminates "prediction shift," a subtle bias in standard gradient boosting
These are built on a third structural decision: using oblivious Decision Trees (symmetric trees where every node at the same depth uses the same split) as base learners — enabling extremely fast prediction and natural regularization.
2. The Central Problem CatBoost Solves
Nearly every real-world tabular dataset contains categorical features:
- User ID, country, city, device type, product category
- Medical diagnosis codes, drug names, hospital IDs
- Customer segment, industry, job title
Standard approaches:
| Method | Problem |
|---|---|
| One-hot encoding | Explodes feature space for high-cardinality categories (k → k binary features) |
| Label encoding | Imposes arbitrary ordinal structure (A=1, B=2, C=3 is meaningless) |
| Target encoding | Uses mean(y) per category — leaks target into features during training |
| Frequency encoding | Loses relationship between category and target |
| Embeddings | Requires additional neural network; expensive |
The target leakage problem is subtle and important. Suppose you encode a category by its training set mean target:
Category "New York" → enc = mean(y | category = "New York") = 0.73
Now when the model trains on a sample from New York, it sees enc=0.73 which was computed using that very sample's y value (and others). The feature has already "seen" the target — the model is using the answer to predict the answer. This leads to overfit encodings and inflated training Accuracy.
CatBoost solves this completely.
3. Ordered Target Encoding — The Core Innovation
3.1 Why Standard Target Encoding Leaks
For a training sample (xᵢ, yᵢ) with a categorical feature c:
Standard target encoding:
enc(c) = (Σⱼ 𝟙[cⱼ=c] · yⱼ) / (Σⱼ 𝟙[cⱼ=c])
This sum includes i itself — the encoding contains information about yᵢ. When the model later tries to predict yᵢ using enc(c), it's partially using yᵢ to predict yᵢ — target leakage.
The effect: Training loss appears lower than it is. The model appears to generalize better than it does. Test performance is disappointing relative to training.
3.2 CatBoost's Ordered Statistics
CatBoost's solution: only use samples that appeared "before" sample i when computing the encoding for sample i.
enc_i(c) = (count_before_i(c) · mean_target_before_i(c) + a · p̄) / (count_before_i(c) + a)
Where:
count_before_i(c)= number of samples with category c that appear before i in the permutationmean_target_before_i(c)= mean target of those samplesa= smoothing parameter (prevents zero-sample estimates)p̄= global mean target (prior)
With smoothing:
enc_i(c) = (n_c_before · ȳ_c_before + a · p̄) / (n_c_before + a)
For the first few samples of category c (n_c_before is small), the encoding shrinks toward the global mean p̄. As more samples of category c are seen, the encoding approaches the empirical mean for that category.
3.3 The Permutation Trick
CatBoost processes training samples in a random permutation σ. Sample i's encoding uses only the subset of samples {σ(1), σ(2), ..., σ(i-1)} that appear before it in the permutation.
Random permutation σ = [sample 47, sample 3, sample 91, sample 12, ...]
For sample 91 (3rd in permutation):
enc(category) is computed from samples 47 and 3 only
→ sample 91's target is NOT in its own encoding
→ No target leakage
Since the permutation is random, no sample has a systematic advantage or disadvantage from appearing early. Early samples have few "predecessors" and noisy encodings (shrunk toward the prior); late samples have more predecessors and more accurate encodings. The model must learn features that generalize across encoding quality — a natural regularizer.
CatBoost uses multiple permutations (one per tree, by default) to reduce variance:
CatBoostClassifier(leaf_estimation_iterations=1) # One permutation
# Each tree uses a different random permutation
4. Ordered Boosting
4.1 Prediction Shift in Standard GBM
Standard gradient boosting has a subtle bias called prediction shift. In round t, the pseudo-residuals are:
rᵢₜ = −∂L(yᵢ, Fₜ₋₁(xᵢ)) / ∂Fₜ₋₁(xᵢ)
The model Fₜ₋₁ was trained on the same samples that include xᵢ. This means the residuals rᵢₜ are computed using a model that has already "seen" xᵢ during training — Fₜ₋₁ is biased toward fitting xᵢ well, making rᵢₜ underestimate the true generalization residual.
This is prediction shift: the pseudo-residuals have conditional bias because F_{t-1} overfits the training data.
Consequence: The boosting algorithm gets over-optimistic pseudo-residuals → trees focus too much on already-well-fitted examples → subtle overfitting.
4.2 CatBoost's Ordered Mode
CatBoost's ordered boosting maintains a sequence of models, one for each possible "dataset prefix":
For each training sample i (in permutation order):
Build Mᵢ = model trained on {samples before i in permutation}
Compute residual rᵢ = −∂L(yᵢ, Mᵢ(xᵢ)) / ∂Mᵢ(xᵢ)
→ rᵢ is computed by a model that has NOT seen xᵢ → unbiased residual
The new tree is fit to the unbiased residuals {rᵢ}. This prevents prediction shift.
In practice: CatBoost uses a memory-efficient implementation that doesn't literally maintain m separate models. Instead, it maintains a tree structure that can efficiently retrieve the model appropriate for each sample's ordered context.
The result: CatBoost's pseudo-residuals are less biased than standard GBM's → better generalization, especially on small-to-medium datasets where overfitting to training residuals is most problematic.
5. Oblivious Decision Trees
5.1 What Are Oblivious Trees?
A standard decision tree has different splits at each node — the split at the root might use feature A, the left child uses feature B, the right child uses feature C.
An oblivious tree uses the same split at every node at the same depth level:
Standard tree: Oblivious tree:
f₁ < 0.5 f₁ < 0.5
/ \ / \
f₂ < 1.0 f₃ < 0.3 f₂ < 1.0 f₂ < 1.0 ← same split at depth 2
/ \ / \ / \ / \
L₀ L₁ L₂ L₃ L₀ L₁ L₂ L₃
An oblivious tree of depth d has exactly 2^d leaves. The path from root to any leaf can be described as a binary vector indicating which side of each threshold the sample falls on at each depth level.
5.2 Why They Work Well
Regularization: Oblivious trees are significantly simpler than standard trees of the same depth. By forcing the same split at each level, the model cannot overfit the specific path to any training sample. This is implicit regularization.
Bias-variance: Higher bias than standard trees at the same depth; lower variance. More trees are needed to compensate, but each tree overfits less.
Feature selection across splits: The oblivious constraint means that the chosen split feature at each depth is the one that, applied to all nodes at that depth, produces the greatest gain. This selects features that are globally useful at that abstraction level, not just locally optimal.
5.3 Fast Prediction with Oblivious Trees
Each leaf in an oblivious tree is indexed by a binary vector of depth-d decisions. Prediction is a table lookup:
depth = 4 → each tree is a lookup table of 2^4 = 16 leaf values
For sample x:
key = [x[f₁] < t₁, x[f₂] < t₂, x[f₃] < t₃, x[f₄] < t₄] (bit vector)
value = leaf_table[key] (O(1) lookup!)
This makes CatBoost's prediction extremely fast — significantly faster than standard asymmetric trees, especially in latency-sensitive production environments.
For GPU inference, the bitwise operations and table lookups are massively parallelizable. CatBoost's GPU prediction speed is among the fastest of any ML algorithm.
6. Categorical Feature Combinations
CatBoost doesn't just encode individual categorical features — it automatically creates feature combinations:
Individual features:
city = "Paris"
device = "mobile"
Combination:
(city, device) = "Paris + mobile" → target-encoded as a joint category
Combinations can capture complex interactions that individual features miss:
- "Paris + mobile" users may have very different behavior than either "Paris users" or "mobile users" alone
- The combination exposes this interaction to the model
CatBoost creates combinations greedily during tree construction: at each node, it considers adding a categorical feature combined with the split path from the root. The number of combinations is controlled by max_ctr_complexity:
CatBoostClassifier(max_ctr_complexity=2) # Allow pairwise combinations
CatBoostClassifier(max_ctr_complexity=4) # Allow 4-way combinations (slow but powerful)
7. Numerical Feature Handling
CatBoost handles numerical features with histogram-based split finding — similar to LightGBM and XGBoost's hist mode:
- Features are binned into discrete buckets
- Splits are evaluated over bin boundaries
- Oblivious tree growth evaluates the same split across all nodes at each depth
CatBoost uses random permutations of the training data (same permutations used for ordered statistics) to add variety to the binning process across trees — a mild form of stochastic boosting.
For numerical features, CatBoost is competitive with but not always faster than LightGBM — the oblivious constraint limits the gain achievable per tree, requiring more trees to compensate.
8. Embeddings and Text Features
CatBoost is the only major GBT library with built-in text feature processing and embedding support:
Text Features
clf = CatBoostClassifier(
text_features=['review_text', 'product_description']
)
clf.fit(X_train, y_train)
Internally, CatBoost tokenizes the text and creates count-based features (bag of words or n-grams), then applies its ordered encoding to the resulting token counts.
Embedding Features
clf = CatBoostClassifier(
embedding_features=['user_embedding', 'item_embedding']
)
Accepts pre-computed embedding vectors as features, with special distance-based split finding that's aware of the embedding space structure.
These features make CatBoost particularly useful for e-commerce, search ranking, and recommendation systems where text and embedding features coexist with tabular features.
9. Missing Value Handling
CatBoost handles missing values through the oblivious tree structure:
- Numerical features: Missing values can be directed to the "min" or "max" side of the split, or use a learned direction
- Categorical features: Missing values are treated as a separate category token and encoded via the same ordered statistics
CatBoostClassifier(nan_mode='Min') # Send NaN to minimum value side
CatBoostClassifier(nan_mode='Max') # Send NaN to maximum value side
CatBoostClassifier(nan_mode='Forbidden') # Raise error on NaN
No imputation required for either numeric or categorical NaN values.
10. GPU Training Architecture
CatBoost has one of the most optimized GPU training architectures among GBT libraries:
Oblivious trees + GPU = natural fit:
- An oblivious tree at depth d evaluates d splits simultaneously, one per depth level
- All depth-level splits can be computed in parallel on GPU
- Each sample needs only d comparisons, making vectorization trivial
CatBoost GPU advantages:
- The oblivious structure allows all 2^d leaf values to be computed simultaneously
- Ordered statistics (CTR) are computed on GPU with a dedicated CUDA kernel
- Feature combination scoring runs in parallel across all combination candidates
from catboost import CatBoostClassifier
clf = CatBoostClassifier(
task_type='GPU',
devices='0', # GPU device ID, or '0:1' for two GPUs
iterations=1000,
depth=6,
learning_rate=0.05
)
CatBoost's GPU is particularly fast for categorical-heavy datasets — the ordered statistics computation benefits enormously from GPU parallelism.
11. Hyperparameters — Complete Reference
| Parameter | Default | Description | Priority |
|---|---|---|---|
iterations |
1000 | Number of trees (n_estimators) | High |
learning_rate |
auto | Shrinkage — auto-tuned based on iterations | High |
depth |
6 | Tree depth — oblivious trees | High |
l2_leaf_reg |
3.0 | L2 regularization on leaf values | High |
bagging_temperature |
1.0 | Bayesian bootstrap intensity (0=no sampling, higher=more) | Medium |
subsample |
N/A | Use with bootstrap_type='Bernoulli' |
Medium |
random_strength |
1.0 | Randomness for split scoring — adds noise to splits | Medium |
border_count |
254 | Histogram bins for numerical features | Low |
cat_features |
None | List of categorical feature names/indices | Required |
one_hot_max_size |
2 | One-hot encode categories with ≤ this many values | Medium |
max_ctr_complexity |
4 | Max categorical feature combinations | Medium |
leaf_estimation_method |
'Newton' | 'Newton' (2nd order) or 'Gradient' (1st order) | Low |
leaf_estimation_iterations |
1 | Steps for leaf value computation | Low |
boosting_type |
'Ordered' | 'Ordered' (default, better) or 'Plain' (faster) | High |
eval_metric |
auto | Metric for early stopping | Medium |
early_stopping_rounds |
None | Patience for early stopping | High |
task_type |
'CPU' | 'CPU' or 'GPU' | - |
verbose |
1 | Frequency of evaluation output | - |
random_seed |
0 | Random seed | - |
nan_mode |
'Min' | How to handle missing values | Low |
text_features |
None | Column names for text features | Optional |
embedding_features |
None | Column names for embedding features | Optional |
monotone_constraints |
None | Dict of feature → | Optional |
feature_weights |
None | Per-feature sampling weights | Low |
penalties_coefficient |
1.0 | Scale for complexity penalties | Low |
Priority parameters:
1. depth (3–10 — deeper than LightGBM due to oblivious constraint)
2. learning_rate + iterations (early stopping recommended)
3. l2_leaf_reg (main regularizer — higher = smoother)
4. random_strength (noise on splits — prevents overfitting)
5. bagging_temperature (bootstrap intensity)
12. Feature Importance and SHAP
Built-in Importance Types
# PredictionValuesChange — how much predictions change when feature is varied
clf.get_feature_importance(type='PredictionValuesChange')
# LossFunctionChange — how much loss increases if feature is removed
clf.get_feature_importance(type='LossFunctionChange') # Most reliable
# ShapValues — SHAP for each sample
clf.get_feature_importance(type='ShapValues', data=test_pool)
# Interaction SHAP
clf.get_feature_importance(type='ShapInteractionValues', data=test_pool)
# Internal importance
clf.get_feature_importance(type='InternalFeatureImportance')
CatBoost + SHAP
from catboost import Pool
import shap
test_pool = Pool(X_test, cat_features=['city', 'device'])
shap_values = clf.get_feature_importance(
type='ShapValues',
data=test_pool
)
# shap_values shape: (n_samples, n_features + 1) — last column is base value
shap.summary_plot(shap_values[:, :-1], X_test)
shap.waterfall_plot(shap.Explanation(
values=shap_values[0, :-1],
base_values=shap_values[0, -1],
data=X_test.iloc[0]
))
CatBoost computes SHAP values natively via its own efficient implementation — no separate shap library call needed (though shap.TreeExplainer also works).
13. Model Analysis and Visualization Tools
CatBoost provides richer built-in analysis tools than XGBoost or LightGBM:
from catboost import CatBoostClassifier
from catboost.utils import get_roc_curve, get_confusion_matrix
import catboost
# Training with plots (interactive in Jupyter)
clf = CatBoostClassifier(plot=True)
clf.fit(train_pool, eval_set=val_pool)
# ROC curve
curve = get_roc_curve(clf, val_pool)
# Prediction distribution
clf.plot_predictions(val_pool)
# Feature statistics (distributions per class)
clf.plot_features_distribution(val_pool)
# SHAP-based feature contribution plot
clf.plot_feature_importance(val_pool, plot=True)
# Select threshold for binary [[Classification]]
catboost.select_threshold(clf, data=val_pool, FPR=0.05)
These tools make CatBoost particularly useful for exploratory analysis and model auditing in business contexts.
14. The Bias-Variance Profile
CatBoost's oblivious trees introduce a different bias-variance tradeoff than XGBoost/LightGBM:
Oblivious tree at depth d: Higher bias, lower variance than standard tree at depth d
→ needs more trees to achieve same bias reduction
→ but each tree overfits less → better small-data behavior
Ordered boosting further reduces variance by using unbiased pseudo-residuals:
Ordered boosting advantage: Larger at small n, diminishes as n → ∞
At n > 100k, 'Plain' mode is often equally good
This is why CatBoost is often recommended for small-to-medium datasets — where prediction shift and overfitting are most problematic. The ordered mechanisms provide more conservative, robust learning.
Small data (< 10k): CatBoost often best — ordered boosting helps most
Medium data (10k-1M): Comparable to XGBoost/LightGBM
Large data (> 1M): LightGBM may be faster, CatBoost competitive
15. Assumptions
| Assumption | Notes |
|---|---|
| IID samples | Ordered boosting requires random ordering — assumes i.i.d. |
| No feature scaling required | Oblivious trees are scale-invariant |
| No distributional assumption | Non-parametric |
| No extrapolation | Tree-based — flat predictions outside training range |
| Sufficient cardinality | Ordered encoding benefits categories with ≥ a few occurrences |
| Clean categories | Typos or inconsistent naming in categorical columns will create spurious categories |
16. Advantages
✅ Best Categorical Feature Handling
No preprocessing required. CatBoost handles high-cardinality categoricals, missing categoricals, and categorical combinations natively — and usually produces better results than manual encoding.
✅ No Target Leakage in Categorical Encoding
Ordered target encoding prevents the subtle leakage that plagues standard target encoding — particularly important for small datasets where overfitting is severe.
✅ Strong Default Performance
CatBoost's defaults are the most carefully chosen of the three major GBT libraries. auto learning rate, depth=6, l2_leaf_reg=3 — these often produce competitive results without tuning.
✅ Fast Prediction
Oblivious trees implement prediction as bit-vector table lookups — O(depth) per sample with extremely low constant factors. CatBoost is often the fastest at inference time.
✅ GPU Training
One of the fastest GPU GBT implementations, especially for categorical-heavy data.
✅ Rich Interpretability Tools
Built-in SHAP, interaction SHAP, LossFunctionChange importance, prediction change tools — more comprehensive than XGBoost/LightGBM.
✅ Text and Embedding Features
Unique among GBT libraries — handles text and pre-computed embeddings natively.
✅ Ordered Boosting for Small Data
Unbiased pseudo-residuals → better generalization on small datasets where prediction shift is a real problem.
✅ Minimal Preprocessing
Handles categoricals, NaN, text, and embeddings natively — reduces data preprocessing burden substantially.
17. Drawbacks & Limitations
❌ Slower on Pure Numerical Datasets
When data has no categorical features, CatBoost's oblivious tree structure provides no advantage and may be slower than LightGBM's leaf-wise histograms. The ordered boosting overhead also slows training.
❌ Memory Intensive for Ordered Boosting
Maintaining ordered statistics requires storing cumulative statistics per permutation — memory increases proportionally with the number of permutations and categorical features.
❌ boosting_type='Ordered' Slow on Large Data
Ordered boosting's O(n log n) complexity is fine for small-medium data but makes CatBoost slower than LightGBM for datasets > 1M rows. Use boosting_type='Plain' for large data.
❌ Depth Limit for Oblivious Trees
A depth-10 oblivious tree has only the same expressive power as a depth-10 standard tree along a single feature path. CatBoost needs higher depth or more trees to match the same capacity as LightGBM's leaf-wise asymmetric trees.
❌ Limited Ecosystem Compared to XGBoost
XGBoost has deeper integration with Dask, Spark, ONNX, and cloud ML frameworks. CatBoost's ecosystem is narrower.
❌ Fewer Custom Loss Options
CatBoost supports custom objectives, but the interface is less flexible than XGBoost's grad/hess system for complex custom losses.
18. CatBoost vs. LightGBM vs. XGBoost
| Property | CatBoost | [[LightGBM]] | XGBoost |
|---|---|---|---|
| Categorical features | ✅✅ Best (native) | ✅ Good (sorted part.) | ❌ Manual encoding |
| Target leakage | ✅ None (ordered) | ⚠️ Standard encoding | ⚠️ Standard encoding |
| Training speed (CPU) | ⚠️ Slower (ordered) | ✅✅ Fastest | ✅ Fast |
| Training speed (GPU) | ✅ Very fast | ✅ Fast | ✅ Fast |
| Prediction speed | ✅✅ Fastest (oblivious) | ✅ Fast | ✅ Fast |
| Small data | ✅✅ Best (ordered) | ⚠️ Careful tuning | ✅ Good |
| Large data | ✅ Good | ✅✅ Best | ✅ Good |
| Memory | Moderate-High | ✅✅ Lowest | Moderate |
| Default performance | ✅✅ Best defaults | ✅ Good | ✅ Good |
| Hyperparameter tuning | ✅ Forgiving | ⚠️ Sensitive | Moderate |
| Text/embedding features | ✅✅ Unique | ❌ | ❌ |
| SHAP support | ✅ Native + shap | ✅ shap | ✅ Best (shap) |
| Interpretability tools | ✅✅ Rich built-in | ✅ Basic | ✅ Good |
| Ecosystem integration | ⚠️ Narrower | ✅ Good | ✅✅ Widest |
| Best for | Categorical data, small-medium | Large data | General purpose |
19. Practical Tips & Gotchas
Always Specify cat_features
from catboost import CatBoostClassifier, Pool
# Identify categorical columns
cat_features = [col for col in X_train.columns
if X_train[col].dtype == 'object' or X_train[col].dtype.name == 'category']
# Using Pool (recommended for large datasets)
train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool = Pool(X_val, y_val, cat_features=cat_features)
clf = CatBoostClassifier(
iterations=1000,
depth=6,
learning_rate=0.05,
l2_leaf_reg=3.0,
early_stopping_rounds=50,
eval_metric='AUC',
verbose=100,
random_seed=42
)
clf.fit(train_pool, eval_set=val_pool)
Use 'Plain' Boosting for Large Datasets
# For n > 100k rows, ordered boosting overhead isn't worth it
clf = CatBoostClassifier(
boosting_type='Plain', # Much faster; similar [[Accuracy]] at large n
iterations=2000,
learning_rate=0.05,
depth=8,
early_stopping_rounds=100,
verbose=200
)
Canonical Hyperparameter Tuning
import optuna
from catboost import CatBoostClassifier, Pool
train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool = Pool(X_val, y_val, cat_features=cat_features)
def objective(trial):
params = {
'iterations': 500,
'depth': trial.suggest_int('depth', 4, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1.0, 20.0, log=True),
'random_strength': trial.suggest_float('random_strength', 0.1, 10.0, log=True),
'bagging_temperature':trial.suggest_float('bagging_temperature', 0.0, 1.0),
'border_count': trial.suggest_categorical('border_count', [32, 64, 128, 254]),
'boosting_type': 'Plain',
'eval_metric': 'AUC',
'verbose': 0,
'random_seed': 42
}
clf = CatBoostClassifier(**params)
clf.fit(train_pool, eval_set=val_pool, early_stopping_rounds=30)
return clf.best_score_['validation']['AUC']
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
Custom Loss Function
class CustomLoss:
def calc_ders_range(self, approxes, targets, weights):
"""Return list of (gradient, hessian) per sample"""
assert len(approxes) == len(targets)
result = []
for approx, target in zip(approxes, targets):
p = 1.0 / (1.0 + np.exp(-approx))
result.append((target - p, p * (1 - p))) # (grad, hess)
return result
clf = CatBoostClassifier(loss_function=CustomLoss())
Model Persistence and Loading
# Save model
clf.save_model('catboost_model.cbm')
# Load model
from catboost import CatBoostClassifier
clf_loaded = CatBoostClassifier()
clf_loaded.load_model('catboost_model.cbm')
# Export to ONNX
clf.save_model('model.onnx', format='onnx',
export_parameters={'onnx_domain': 'ai.catboost'})
Feature Importance Pipeline
# LossFunctionChange — most reliable importance type
importance = clf.get_feature_importance(
type='LossFunctionChange',
data=val_pool
)
import pandas as pd
feat_imp = pd.Series(importance, index=feature_names).sort_values(ascending=False)
print(feat_imp.head(20))
# SHAP
shap_vals = clf.get_feature_importance(type='ShapValues', data=val_pool)
import shap
shap.summary_plot(shap_vals[:, :-1], X_val, feature_names=feature_names)
20. When to Use It
Use CatBoost when:
- Categorical features dominate your dataset — it's the clear winner here
- High-cardinality categories are present (city, user ID, product ID with thousands of unique values)
- You want minimal preprocessing — CatBoost handles raw categoricals, NaN, and even text
- Small to medium dataset (< 500k rows) where ordered boosting's bias reduction matters
- You want strong default performance with minimal hyperparameter tuning
- Fast inference is needed — oblivious tree lookup is the fastest GBT prediction
- GPU training with categorical features — CatBoost GPU is optimized for this
- You need built-in SHAP and interpretability tools without external dependencies
- Text or embedding features alongside tabular features
- Regulatory/audit requirements — LossFunctionChange importance is more principled
Use [[LightGBM]] instead when:
- Dataset is very large (> 1M rows) and training speed is the bottleneck
- Pure numerical features dominate — no categorical advantage
- Memory is constrained
Use XGBoost instead when:
- You need custom loss functions with full gradient/hessian flexibility
- Ecosystem integration (Spark, Dask, ONNX) is critical
- You need mature ranking support (LambdaMART)
Summary
┌─────────────────────────────────────────────────────────────────────┐
│ CATBOOST AT A GLANCE │
├─────────────────────────────────────────────────────────────────────┤
│ CORE IDEA Ordered encoding + ordered boosting + oblivious trees │
│ CATEGORICAL Ordered target encoding — no preprocessing, no leakage│
│ TREE TYPE Oblivious (symmetric) — same split at each depth level│
│ PREDICTION Bit-vector table lookup → fastest inference │
│ ORDERING Random permutation per tree → unbiased pseudo-residuals│
│ KEY PARAMS depth, l2_leaf_reg, random_strength, iterations │
│ STRENGTH Categoricals, small data, defaults, fast prediction │
│ WEAKNESS Slow on large pure-numerical data (use Plain mode) │
│ vs LGB/XGB Best for cat features; LGB wins on large numeric data │
│ BEST FOR Categorical-heavy data, small-medium, fast inference │
└─────────────────────────────────────────────────────────────────────┘
CatBoost asked the question that [[XGBoost]] and [[LightGBM]] implicitly avoided: what about the data that every real-world dataset actually contains? Categorical features are not an edge case — they are the norm in e-commerce, fraud detection, medical records, and search ranking. The industry response to this was years of careful feature engineering: mean encoding, frequency encoding, one-hot explosions, and target encoding with cross-validation tricks. CatBoost replaced all of this with a principled statistical solution — ordered statistics that cannot leak, evaluated on permutations that cannot overfit, in trees that cannot meander. The result is an algorithm that requires almost no preprocessing for the most common type of real-world tabular data. That is not a minor contribution; it is a fundamental simplification of how we use gradient boosting in practice.