Extra-Trees

Extra-Trees

Extremely Randomized Trees

πŸ”— Related ensemble methods:

1. What Are Extra-Trees?

Extra-Trees (Extremely Randomized Trees), introduced by Geurts, Ernst, and Wehenkel (2006), is an ensemble of decision trees distinguished from Random Forest by two modifications:

  1. No bootstrap sampling β€” each tree is trained on the full training dataset
  2. Random split thresholds β€” instead of finding the optimal threshold for each candidate feature, Extra-Trees draws a random threshold uniformly from the feature's observed range

These changes push randomness further than Random Forest, yielding an ensemble with lower variance, slightly higher bias, and significantly faster training β€” while maintaining accuracy within ~1% of Random Forest on most datasets.

Property Extra-Trees Random Forest
Bootstrap ❌ No β€” full dataset per tree βœ… Yes β€” bootstrap sample per tree
Feature subset βœ… Random subset per split βœ… Random subset per split
Split threshold ❌ Random (one per feature candidate) βœ… Optimal (scans all unique values)
Training speed βœ…βœ… 2–5Γ— faster βœ… Fast
Variance βœ…βœ… Lower βœ… Low
Bias Slightly higher Low
OOB evaluation ❌ No (no bootstrap by default) βœ… Yes

2. The Core Innovation β€” Randomized Thresholds

In a standard decision tree, split-finding for feature f scans every unique value as a candidate threshold:

For each unique value t of feature f:
    Compute ImpurityGain(S, f, t)
Best: t* = argmax_t ImpurityGain(S, f, t)
β†’ O(m log m) per feature per node  (sort + scan)

In Extra-Trees, for each candidate feature f, exactly one random threshold is drawn:

t_f ~ Uniform(min(feature_f in S), max(feature_f in S))
score_f = ImpurityGain(S, f, t_f)
β†’ O(m) per feature per node  (no sort, one evaluation)

The best feature (not the best threshold) is then selected:

f* = argmax_{f ∈ F_sub} score_f
Split: use (f*, t_{f*}) β€” the best feature with its random threshold

What is still optimized: which feature to split on.
What is randomized: the threshold for each candidate feature.


3. Mathematical Foundation

3.1 Split Selection in Extra-Trees

At each node with samples S and random feature subset F_sub of size K:

for f in F_sub:
    a_f = min_{i∈S} x_{if},   b_f = max_{i∈S} x_{if}
    t_f ~ Uniform(a_f, b_f)
    score_f = ImpurityGain(S, f, t_f)

(f*, t*) = argmax_f score_f

For categorical features: a random binary partition of categories is drawn instead of a threshold.


3.2 Sources of Randomness Compared to Random Forest

Randomness source Random Forest Extra-Trees
Data (bootstrap) βœ… Per-tree resampling ❌ Full dataset always
Feature subset βœ… √p per split βœ… √p per split
Threshold selection ❌ Optimal (deterministic) βœ… Random (stochastic)

Extra-Trees replaces data randomness with threshold randomness. The total randomness is comparable; its structure is different.


3.3 Expected Split Gain Under Randomization

For feature f, the optimal threshold achieves:

Gain*(f) = max_t ImpurityGain(S, f, t)

A uniformly random threshold achieves:

E_t[Gain(f, t)] ≀ Gain*(f)    (in expectation, worse than optimal)

The gap between E[random gain] and optimal gain:

In the limit of infinite data, Extra-Trees and Random Forest achieve the same bias β€” the threshold randomization is asymptotically negligible.


4. Bias-Variance Analysis β€” In Depth

4.1 Variance: Lower Than Random Forest

Variance of the ensemble average FΜ‚ = (1/B)Ξ£ fΜ‚_b:

Var(FΜ‚) = ρ Β· σ² + (1βˆ’Ο) Β· σ²/B

Extra-Trees achieves lower ρ (inter-tree correlation) than Random Forest because:

  1. Random thresholds β€” two trees seeing the same data still produce different splits on the same feature (different random thresholds), diverging their predictions
  2. No bootstrap β€” all trees see the same samples, but their random splits create highly variable tree structures despite this

Net effect: ρ_{Extra-Trees} < ρ_{Random Forest}, so the variance floor ρσ² is lower.


4.2 Bias: Slightly Higher Than Random Forest

Extra-Trees has slightly higher bias because:

In practice the bias difference is small β€” typically < 1% accuracy.


4.3 When Extra-Trees Wins vs. When RF Wins

Extra-Trees wins:
    - High-noise datasets (many irrelevant features, label noise)
      β†’ Random thresholds resist noise-induced optimal splits
    - Large datasets (m >> 10k)
      β†’ Asymptotic equivalence means bias gap closes; speed advantage matters
    - Speed-constrained training
      β†’ 2–5Γ— faster for equivalent n_estimators

Random Forest wins:
    - Small datasets (m < 5k)
      β†’ Bootstrap helps with limited samples; bias gap is larger
    - Sharp decision boundaries
      β†’ Optimal threshold search captures precise boundaries
    - OOB evaluation needed
      β†’ Not available in Extra-Trees by default

5. The Full Training Algorithm

function ExtraTrees(D, B, max_features, min_samples_leaf):

    forest = []

    parallel for b = 1 to B:
        # NO bootstrap β€” every tree sees the full dataset D
        tree_b = GrowExtraTree(D, max_features, min_samples_leaf)
        forest.append(tree_b)

    return forest


function GrowExtraTree(S, max_features, min_samples_leaf):

    # Stopping conditions
    if |S| <= min_samples_leaf or pure(S):
        return Leaf(majority_class(S))

    # Random feature subset
    F_sub = random_sample(all_features, size=max_features, replace=False)

    best_score = -∞
    best = None

    for f in F_sub:
        # ONE random threshold from the range of feature f in S
        t_f = Uniform(min(S[:, f]), max(S[:, f]))
        score = ImpurityGain(S, f, t_f)
        if score > best_score:
            best_score, best = score, (f, t_f)

    if best_score <= 0:
        return Leaf(majority_class(S))

    S_L, S_R = split(S, best)
    return InternalNode(best,
        left  = GrowExtraTree(S_L, max_features, min_samples_leaf),
        right = GrowExtraTree(S_R, max_features, min_samples_leaf))

Complexity: O(B Β· m Β· K Β· depth) β€” no log(m) factor from sorting.


6. Why No Bootstrap?

Geurts et al. deliberately chose to omit bootstrap sampling, reasoning that:

If you need OOB evaluation, set bootstrap=True in sklearn β€” this creates a "bootstrapped Extra-Trees" that combines both sources of data randomness.


7. Speed Advantage β€” Where It Comes From

The speed difference over Random Forest has two sources:

Source 1 β€” No threshold search (primary):

RF per node per feature:   Sort O(m log m) + scan O(m) = O(m log m)
ET per node per feature:   Draw threshold O(1) + evaluate O(m) = O(m)
Speedup factor:            log(m)    (e.g., log(100000) β‰ˆ 12Γ—)

Source 2 β€” No bootstrap sampling:

RF per tree:   O(m) sampling overhead
ET per tree:   Zero sampling overhead

Combined: typically 2–5Γ— wall-clock speedup for the same n_estimators and max_features. For very large m (millions), the speedup approaches the theoretical log(m) factor.


8. Extra-Trees as Regularization

Random threshold selection functions as implicit regularization: splits that require precisely-tuned thresholds are penalized; splits that work across a wide range of thresholds are favored and consistently selected.

This is analogous to dropout in neural networks β€” deliberately injecting noise to prevent overfitting to training-data-specific optima.

Consequence: On datasets with label noise or many irrelevant features, Extra-Trees often outperforms Random Forest, because Random Forest's optimal threshold search fits noise-induced split points that don't generalize.


9. Extra-Trees for Regression

from sklearn.ensemble import ExtraTreesRegressor

etr = ExtraTreesRegressor(
    n_estimators=500,
    max_features=1.0,     # Geurts et al. recommend all features for regression
    min_samples_leaf=5,
    n_jobs=-1,
    random_state=42
)
etr.fit(X_train, y_train)
print(f"RΒ²: {etr.score(X_test, y_test):.4f}")

Key difference: For regression, the recommended max_features=1.0 (all features), not √p. The original paper found this optimal β€” random thresholds alone provide sufficient decorrelation for regression without feature subsampling.


10. Feature Importance

All three importance tiers are available:

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.inspection import permutation_importance
import shap

etc = ExtraTreesClassifier(n_estimators=500, n_jobs=-1).fit(X_train, y_train)

# MDI (fast, slightly noisier than RF due to random thresholds)
mdi = pd.Series(etc.feature_importances_, index=feature_names)

# MDA (preferred for Extra-Trees β€” more reliable than MDI)
result = permutation_importance(etc, X_val, y_val, n_repeats=20, n_jobs=-1)
mda = pd.Series(result.importances_mean, index=feature_names)

# SHAP (exact via TreeExplainer)
explainer = shap.TreeExplainer(etc)
shap_values = explainer.shap_values(X_test)

Note: MDI is slightly less reliable for Extra-Trees than Random Forest β€” a feature may appear important because a lucky random threshold happened to produce good splits. Prefer MDA or SHAP for feature selection decisions.


11. Hyperparameters β€” Complete Reference

from sklearn.ensemble import ExtraTreesClassifier

ExtraTreesClassifier(
    n_estimators=100,           # Trees β€” more is better, set 500+
    criterion='gini',           # 'gini', 'entropy', 'log_loss'
    max_depth=None,             # None = fully grown (default, recommended)
    min_samples_split=2,
    min_samples_leaf=1,         # Primary regularizer β€” tune 1–50
    max_features='sqrt',        # 'sqrt', 'log2', None, int, float
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=False,            # KEY: False by default (no OOB)
    oob_score=False,            # Set True only if bootstrap=True
    n_jobs=-1,
    random_state=42,
    warm_start=False,
    class_weight=None           # 'balanced', dict
)

Tuning priority:

1. n_estimators:     500+ for production
2. max_features:     'sqrt' (classification) or 1.0 (regression)
3. min_samples_leaf: Tune for overfitting (1–50)
4. bootstrap:        True if OOB needed; False otherwise

12. Assumptions

Assumption Notes
IID samples Standard; no bootstrap means no resampling diversity
No feature scaling required Scale-invariant (tree splits, even random ones)
No distributional assumption Non-parametric
No extrapolation Like all tree-based models
Threshold range covers signal Random threshold drawn from [min, max] β€” extreme optimal thresholds may be missed

13. Advantages

βœ… Fastest Tree Ensemble Training

No sorting, no bootstrap β€” 2–5Γ— faster than Random Forest with equivalent parameters.

βœ… Lower Variance Than Random Forest

Random thresholds further reduce inter-tree correlation beyond bootstrap + feature subsets.

βœ… Effective Noise Regularization

Random thresholds resist fitting noise-induced optimal split points that don't generalize.

βœ… Full sklearn Compatibility

Identical API to Random Forest β€” drop-in replacement for speed experiments.

βœ… Parallelizable

Fully independent trees; linear speedup with CPU cores.

βœ… Competitive Accuracy

Usually within 1% of Random Forest β€” an excellent trade for 2–5Γ— training speed.


14. Drawbacks & Limitations

❌ No OOB Evaluation by Default

Without bootstrap, no free validation estimate β€” requires an explicit validation split.

❌ Slightly Higher Bias

Random thresholds are suboptimal β€” fine for most data, but problematic when precise boundaries matter.

❌ MDI Less Reliable

Feature importances from random-threshold splits are noisier than Random Forest's MDI.

❌ Weaker on Small Datasets

Bootstrap's stabilizing effect on small samples is absent β€” RF is more appropriate below ~5k samples.

❌ No Extrapolation

Same tree-based limitation as Random Forest.


15. Extra-Trees vs. Random Forest vs. Bagging

Property Extra-Trees Random Forest Bagging Classifier
Bootstrap ❌ No (default) βœ… Yes βœ… Yes
Feature subset βœ… √p per split βœ… √p per split ❌ All features
Threshold ❌ Random βœ… Optimal βœ… Optimal (base tree)
Training speed βœ…βœ… Fastest βœ… Fast βœ… Fast
Variance βœ… Lowest βœ… Low Medium
Bias Slightly higher Low Same as base learner
OOB ❌ No βœ… Yes βœ… Yes
Any base learner ❌ Trees only ❌ Trees only βœ… Any
Best for Speed-constrained General purpose Custom base learners

16. Practical Tips & Gotchas

Drop-in Speed Test

import time
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

for name, clf in [
    ('RF',  RandomForestClassifier(n_estimators=500, n_jobs=-1)),
    ('ET',  ExtraTreesClassifier(n_estimators=500, n_jobs=-1))
]:
    t0 = time.time()
    clf.fit(X_train, y_train)
    train_time = time.time() - t0
    score = clf.score(X_test, y_test)
    print(f"{name}: {score:.4f} accuracy, {train_time:.1f}s training")

Enable OOB When Needed

etc = ExtraTreesClassifier(
    n_estimators=500,
    bootstrap=True,    # Adds bootstrap to Extra-Trees
    oob_score=True,
    n_jobs=-1
)
etc.fit(X_train, y_train)
print(f"OOB: {etc.oob_score_:.4f}")

Use as Fast Feature Selector

from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(
    ExtraTreesClassifier(n_estimators=200, n_jobs=-1),
    threshold='mean'
)
X_selected = selector.fit_transform(X_train, y_train)
print(f"Selected {X_selected.shape[1]} of {X_train.shape[1]} features")

Tune min_samples_leaf for Regularization

from sklearn.model_selection import cross_val_score

for msl in [1, 2, 5, 10, 20, 50]:
    etc = ExtraTreesClassifier(n_estimators=200, min_samples_leaf=msl, n_jobs=-1)
    score = cross_val_score(etc, X_train, y_train, cv=5, scoring='roc_auc').mean()
    print(f"min_samples_leaf={msl}: {score:.4f}")

17. When to Use It

Use Extra-Trees when:

Use Random Forest instead when:


Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  EXTRA-TREES AT A GLANCE                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  RANDOMNESS    Random feature subsets + random split thresholds     β”‚
β”‚  NO BOOTSTRAP  Full training dataset per tree (no OOB by default)   β”‚
β”‚  SPEED         O(BΒ·mΒ·KΒ·depth) vs O(BΒ·mΒ·KΒ·log(m)Β·depth) for RF     β”‚
β”‚  VARIANCE      Lower than RF β€” random thresholds reduce ρ further   β”‚
β”‚  BIAS          Slightly higher β€” suboptimal thresholds              β”‚
β”‚  OOB           ❌ Not available unless bootstrap=True               β”‚
β”‚  STRENGTH      Speed, noise robustness, variance reduction          β”‚
β”‚  WEAKNESS      No OOB, slightly higher bias, noisier MDI            β”‚
β”‚  vs RF         Usually ≀ 1% accuracy gap; 2–5Γ— faster training      β”‚
β”‚  BEST FOR      Large/noisy datasets, speed-constrained environments β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extra-Trees proves that in ensemble learning, the ensemble mechanism matters more than the quality of any individual split. Random thresholds lose almost nothing in accuracy while saving enormous compute β€” an empirical demonstration that the information in individual splits is largely redundant when you're building hundreds of trees.

Powered by Forestry.md