Extra-Trees
Extra-Trees
Extremely Randomized Trees
π Related ensemble methods:
1. What Are Extra-Trees?
Extra-Trees (Extremely Randomized Trees), introduced by Geurts, Ernst, and Wehenkel (2006), is an ensemble of decision trees distinguished from Random Forest by two modifications:
- No bootstrap sampling β each tree is trained on the full training dataset
- Random split thresholds β instead of finding the optimal threshold for each candidate feature, Extra-Trees draws a random threshold uniformly from the feature's observed range
These changes push randomness further than Random Forest, yielding an ensemble with lower variance, slightly higher bias, and significantly faster training β while maintaining accuracy within ~1% of Random Forest on most datasets.
| Property | Extra-Trees | Random Forest |
|---|---|---|
| Bootstrap | β No β full dataset per tree | β Yes β bootstrap sample per tree |
| Feature subset | β Random subset per split | β Random subset per split |
| Split threshold | β Random (one per feature candidate) | β Optimal (scans all unique values) |
| Training speed | β β 2β5Γ faster | β Fast |
| Variance | β β Lower | β Low |
| Bias | Slightly higher | Low |
| OOB evaluation | β No (no bootstrap by default) | β Yes |
2. The Core Innovation β Randomized Thresholds
In a standard decision tree, split-finding for feature f scans every unique value as a candidate threshold:
For each unique value t of feature f:
Compute ImpurityGain(S, f, t)
Best: t* = argmax_t ImpurityGain(S, f, t)
β O(m log m) per feature per node (sort + scan)
In Extra-Trees, for each candidate feature f, exactly one random threshold is drawn:
t_f ~ Uniform(min(feature_f in S), max(feature_f in S))
score_f = ImpurityGain(S, f, t_f)
β O(m) per feature per node (no sort, one evaluation)
The best feature (not the best threshold) is then selected:
f* = argmax_{f β F_sub} score_f
Split: use (f*, t_{f*}) β the best feature with its random threshold
What is still optimized: which feature to split on.
What is randomized: the threshold for each candidate feature.
3. Mathematical Foundation
3.1 Split Selection in Extra-Trees
At each node with samples S and random feature subset F_sub of size K:
for f in F_sub:
a_f = min_{iβS} x_{if}, b_f = max_{iβS} x_{if}
t_f ~ Uniform(a_f, b_f)
score_f = ImpurityGain(S, f, t_f)
(f*, t*) = argmax_f score_f
For categorical features: a random binary partition of categories is drawn instead of a threshold.
3.2 Sources of Randomness Compared to Random Forest
| Randomness source | Random Forest | Extra-Trees |
|---|---|---|
| Data (bootstrap) | β Per-tree resampling | β Full dataset always |
| Feature subset | β βp per split | β βp per split |
| Threshold selection | β Optimal (deterministic) | β Random (stochastic) |
Extra-Trees replaces data randomness with threshold randomness. The total randomness is comparable; its structure is different.
3.3 Expected Split Gain Under Randomization
For feature f, the optimal threshold achieves:
Gain*(f) = max_t ImpurityGain(S, f, t)
A uniformly random threshold achieves:
E_t[Gain(f, t)] β€ Gain*(f) (in expectation, worse than optimal)
The gap between E[random gain] and optimal gain:
- Shrinks as m β β (Geurts et al. 2006 prove asymptotic equivalence)
- Is smaller when the optimal threshold is near the center of the feature range
- Is larger when the optimal threshold is at an extreme value (rare but possible)
In the limit of infinite data, Extra-Trees and Random Forest achieve the same bias β the threshold randomization is asymptotically negligible.
4. Bias-Variance Analysis β In Depth
4.1 Variance: Lower Than Random Forest
Variance of the ensemble average FΜ = (1/B)Ξ£ fΜ_b:
Var(FΜ) = Ο Β· ΟΒ² + (1βΟ) Β· ΟΒ²/B
Extra-Trees achieves lower Ο (inter-tree correlation) than Random Forest because:
- Random thresholds β two trees seeing the same data still produce different splits on the same feature (different random thresholds), diverging their predictions
- No bootstrap β all trees see the same samples, but their random splits create highly variable tree structures despite this
Net effect: Ο_{Extra-Trees} < Ο_{Random Forest}, so the variance floor ΟΟΒ² is lower.
4.2 Bias: Slightly Higher Than Random Forest
Extra-Trees has slightly higher bias because:
- Each split uses a random rather than optimal threshold β expected impurity reduction per split is lower
- Trees trained on the full dataset (vs. bootstrap) are less diverse in their training distributions β the ensemble's bias doesn't benefit from bootstrap's data perturbation
In practice the bias difference is small β typically < 1% accuracy.
4.3 When Extra-Trees Wins vs. When RF Wins
Extra-Trees wins:
- High-noise datasets (many irrelevant features, label noise)
β Random thresholds resist noise-induced optimal splits
- Large datasets (m >> 10k)
β Asymptotic equivalence means bias gap closes; speed advantage matters
- Speed-constrained training
β 2β5Γ faster for equivalent n_estimators
Random Forest wins:
- Small datasets (m < 5k)
β Bootstrap helps with limited samples; bias gap is larger
- Sharp decision boundaries
β Optimal threshold search captures precise boundaries
- OOB evaluation needed
β Not available in Extra-Trees by default
5. The Full Training Algorithm
function ExtraTrees(D, B, max_features, min_samples_leaf):
forest = []
parallel for b = 1 to B:
# NO bootstrap β every tree sees the full dataset D
tree_b = GrowExtraTree(D, max_features, min_samples_leaf)
forest.append(tree_b)
return forest
function GrowExtraTree(S, max_features, min_samples_leaf):
# Stopping conditions
if |S| <= min_samples_leaf or pure(S):
return Leaf(majority_class(S))
# Random feature subset
F_sub = random_sample(all_features, size=max_features, replace=False)
best_score = -β
best = None
for f in F_sub:
# ONE random threshold from the range of feature f in S
t_f = Uniform(min(S[:, f]), max(S[:, f]))
score = ImpurityGain(S, f, t_f)
if score > best_score:
best_score, best = score, (f, t_f)
if best_score <= 0:
return Leaf(majority_class(S))
S_L, S_R = split(S, best)
return InternalNode(best,
left = GrowExtraTree(S_L, max_features, min_samples_leaf),
right = GrowExtraTree(S_R, max_features, min_samples_leaf))
Complexity: O(B Β· m Β· K Β· depth) β no log(m) factor from sorting.
6. Why No Bootstrap?
Geurts et al. deliberately chose to omit bootstrap sampling, reasoning that:
- Random thresholds already provide sufficient tree decorrelation
- Full dataset gives more samples per leaf β less estimation variance per leaf
- Combining bootstrap + random thresholds adds no meaningful benefit (empirically verified in the paper)
If you need OOB evaluation, set bootstrap=True in sklearn β this creates a "bootstrapped Extra-Trees" that combines both sources of data randomness.
7. Speed Advantage β Where It Comes From
The speed difference over Random Forest has two sources:
Source 1 β No threshold search (primary):
RF per node per feature: Sort O(m log m) + scan O(m) = O(m log m)
ET per node per feature: Draw threshold O(1) + evaluate O(m) = O(m)
Speedup factor: log(m) (e.g., log(100000) β 12Γ)
Source 2 β No bootstrap sampling:
RF per tree: O(m) sampling overhead
ET per tree: Zero sampling overhead
Combined: typically 2β5Γ wall-clock speedup for the same n_estimators and max_features. For very large m (millions), the speedup approaches the theoretical log(m) factor.
8. Extra-Trees as Regularization
Random threshold selection functions as implicit regularization: splits that require precisely-tuned thresholds are penalized; splits that work across a wide range of thresholds are favored and consistently selected.
This is analogous to dropout in neural networks β deliberately injecting noise to prevent overfitting to training-data-specific optima.
Consequence: On datasets with label noise or many irrelevant features, Extra-Trees often outperforms Random Forest, because Random Forest's optimal threshold search fits noise-induced split points that don't generalize.
9. Extra-Trees for Regression
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor(
n_estimators=500,
max_features=1.0, # Geurts et al. recommend all features for regression
min_samples_leaf=5,
n_jobs=-1,
random_state=42
)
etr.fit(X_train, y_train)
print(f"RΒ²: {etr.score(X_test, y_test):.4f}")
Key difference: For regression, the recommended max_features=1.0 (all features), not βp. The original paper found this optimal β random thresholds alone provide sufficient decorrelation for regression without feature subsampling.
10. Feature Importance
All three importance tiers are available:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.inspection import permutation_importance
import shap
etc = ExtraTreesClassifier(n_estimators=500, n_jobs=-1).fit(X_train, y_train)
# MDI (fast, slightly noisier than RF due to random thresholds)
mdi = pd.Series(etc.feature_importances_, index=feature_names)
# MDA (preferred for Extra-Trees β more reliable than MDI)
result = permutation_importance(etc, X_val, y_val, n_repeats=20, n_jobs=-1)
mda = pd.Series(result.importances_mean, index=feature_names)
# SHAP (exact via TreeExplainer)
explainer = shap.TreeExplainer(etc)
shap_values = explainer.shap_values(X_test)
Note: MDI is slightly less reliable for Extra-Trees than Random Forest β a feature may appear important because a lucky random threshold happened to produce good splits. Prefer MDA or SHAP for feature selection decisions.
11. Hyperparameters β Complete Reference
from sklearn.ensemble import ExtraTreesClassifier
ExtraTreesClassifier(
n_estimators=100, # Trees β more is better, set 500+
criterion='gini', # 'gini', 'entropy', 'log_loss'
max_depth=None, # None = fully grown (default, recommended)
min_samples_split=2,
min_samples_leaf=1, # Primary regularizer β tune 1β50
max_features='sqrt', # 'sqrt', 'log2', None, int, float
max_leaf_nodes=None,
min_impurity_decrease=0.0,
bootstrap=False, # KEY: False by default (no OOB)
oob_score=False, # Set True only if bootstrap=True
n_jobs=-1,
random_state=42,
warm_start=False,
class_weight=None # 'balanced', dict
)
Tuning priority:
1. n_estimators: 500+ for production
2. max_features: 'sqrt' (classification) or 1.0 (regression)
3. min_samples_leaf: Tune for overfitting (1β50)
4. bootstrap: True if OOB needed; False otherwise
12. Assumptions
| Assumption | Notes |
|---|---|
| IID samples | Standard; no bootstrap means no resampling diversity |
| No feature scaling required | Scale-invariant (tree splits, even random ones) |
| No distributional assumption | Non-parametric |
| No extrapolation | Like all tree-based models |
| Threshold range covers signal | Random threshold drawn from [min, max] β extreme optimal thresholds may be missed |
13. Advantages
β Fastest Tree Ensemble Training
No sorting, no bootstrap β 2β5Γ faster than Random Forest with equivalent parameters.
β Lower Variance Than Random Forest
Random thresholds further reduce inter-tree correlation beyond bootstrap + feature subsets.
β Effective Noise Regularization
Random thresholds resist fitting noise-induced optimal split points that don't generalize.
β Full sklearn Compatibility
Identical API to Random Forest β drop-in replacement for speed experiments.
β Parallelizable
Fully independent trees; linear speedup with CPU cores.
β Competitive Accuracy
Usually within 1% of Random Forest β an excellent trade for 2β5Γ training speed.
14. Drawbacks & Limitations
β No OOB Evaluation by Default
Without bootstrap, no free validation estimate β requires an explicit validation split.
β Slightly Higher Bias
Random thresholds are suboptimal β fine for most data, but problematic when precise boundaries matter.
β MDI Less Reliable
Feature importances from random-threshold splits are noisier than Random Forest's MDI.
β Weaker on Small Datasets
Bootstrap's stabilizing effect on small samples is absent β RF is more appropriate below ~5k samples.
β No Extrapolation
Same tree-based limitation as Random Forest.
15. Extra-Trees vs. Random Forest vs. Bagging
| Property | Extra-Trees | Random Forest | Bagging Classifier |
|---|---|---|---|
| Bootstrap | β No (default) | β Yes | β Yes |
| Feature subset | β βp per split | β βp per split | β All features |
| Threshold | β Random | β Optimal | β Optimal (base tree) |
| Training speed | β β Fastest | β Fast | β Fast |
| Variance | β Lowest | β Low | Medium |
| Bias | Slightly higher | Low | Same as base learner |
| OOB | β No | β Yes | β Yes |
| Any base learner | β Trees only | β Trees only | β Any |
| Best for | Speed-constrained | General purpose | Custom base learners |
16. Practical Tips & Gotchas
Drop-in Speed Test
import time
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
for name, clf in [
('RF', RandomForestClassifier(n_estimators=500, n_jobs=-1)),
('ET', ExtraTreesClassifier(n_estimators=500, n_jobs=-1))
]:
t0 = time.time()
clf.fit(X_train, y_train)
train_time = time.time() - t0
score = clf.score(X_test, y_test)
print(f"{name}: {score:.4f} accuracy, {train_time:.1f}s training")
Enable OOB When Needed
etc = ExtraTreesClassifier(
n_estimators=500,
bootstrap=True, # Adds bootstrap to Extra-Trees
oob_score=True,
n_jobs=-1
)
etc.fit(X_train, y_train)
print(f"OOB: {etc.oob_score_:.4f}")
Use as Fast Feature Selector
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(
ExtraTreesClassifier(n_estimators=200, n_jobs=-1),
threshold='mean'
)
X_selected = selector.fit_transform(X_train, y_train)
print(f"Selected {X_selected.shape[1]} of {X_train.shape[1]} features")
Tune min_samples_leaf for Regularization
from sklearn.model_selection import cross_val_score
for msl in [1, 2, 5, 10, 20, 50]:
etc = ExtraTreesClassifier(n_estimators=200, min_samples_leaf=msl, n_jobs=-1)
score = cross_val_score(etc, X_train, y_train, cv=5, scoring='roc_auc').mean()
print(f"min_samples_leaf={msl}: {score:.4f}")
17. When to Use It
Use Extra-Trees when:
- Training speed is the primary constraint
- Dataset is large (> 50k rows) where log(m) threshold search is expensive
- Data is noisy β random thresholds provide additional regularization
- Accuracy close to Random Forest is acceptable with much faster training
- Fast feature importance screening (follow up with MDA or SHAP)
Use Random Forest instead when:
- OOB evaluation is needed without a separate validation set
- Dataset is small (< 5k rows)
- Sharp, precisely-located decision boundaries matter
- MDI reliability is important for feature selection
Summary
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXTRA-TREES AT A GLANCE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β RANDOMNESS Random feature subsets + random split thresholds β
β NO BOOTSTRAP Full training dataset per tree (no OOB by default) β
β SPEED O(BΒ·mΒ·KΒ·depth) vs O(BΒ·mΒ·KΒ·log(m)Β·depth) for RF β
β VARIANCE Lower than RF β random thresholds reduce Ο further β
β BIAS Slightly higher β suboptimal thresholds β
β OOB β Not available unless bootstrap=True β
β STRENGTH Speed, noise robustness, variance reduction β
β WEAKNESS No OOB, slightly higher bias, noisier MDI β
β vs RF Usually β€ 1% accuracy gap; 2β5Γ faster training β
β BEST FOR Large/noisy datasets, speed-constrained environments β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Extra-Trees proves that in ensemble learning, the ensemble mechanism matters more than the quality of any individual split. Random thresholds lose almost nothing in accuracy while saving enormous compute β an empirical demonstration that the information in individual splits is largely redundant when you're building hundreds of trees.