SAMME & SAMME.R

Stagewise Additive Modeling using Multi-class Exponential Loss

1. What Are SAMME and SAMME.R?

SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss) and SAMME.R (the "Real" variant) are the canonical extensions of AdaBoost to K ≥ 2 class problems. They were introduced by Zhu, Zou, Rosset, and Hastie in 2009.

The key insight of this paper was that naive extensions of binary AdaBoost to multi-class problems are either incorrect (they ignore the structure of the K-class problem) or require solving K binary subproblems (OvR). SAMME derives from first principles what the correct alpha formula must be for a direct multi-class boosting algorithm, and SAMME.R extends this to use probabilistic weak learner outputs for dramatically faster convergence.

Property	SAMME	SAMME.R
Weak learner output	Hard class labels	Class probability estimates
Learner requirement	Only `predict`	Requires `predict_proba`
Convergence speed	Slower (one update per round)	Faster (uses full probability info)
Sensitivity to calibration	Low	High — needs calibrated probs
Alpha formula	½·log((1−εₜ)/εₜ) + log(K−1)	Not a single scalar — vector update
sklearn default	SAMME (sklearn ≥ 1.2 deprecated SAMME.R)	Was previous default

2. The Problem: Extending AdaBoost to K > 2 Classes

Binary AdaBoost encodes labels as y ∈ {−1, +1} and classifiers as h(x) ∈ {−1, +1}. The final model is:

H(x) = sign( Σₜ αₜ hₜ(x) )

For K classes, the naive approaches have problems:

One-vs-Rest (OvR): Train K binary classifiers, each distinguishing class k from all others. Inefficient, doesn't naturally produce a coherent multi-class margin.

Naive extension: Just add more alpha terms. But what should alpha be for K classes? The binary derivation explicitly uses y ∈ {−1,+1} — it doesn't generalize directly.

AdaBoost.M1: The earliest multi-class extension — applies binary AdaBoost with K-class labels directly. Works only when ε < 0.5, which becomes harder to guarantee as K grows. Fails completely if any round produces ε ≥ 0.5.

AdaBoost.MH / AdaBoost.MO: Various alternatives that reformulate as multi-label or output code problems. Complex and not as principled.

SAMME solves this by re-deriving the alpha formula from the correct multi-class exponential loss rather than borrowing from the binary case.

3. Mathematical Foundation — Multi-Class Exponential Loss

3.1 The Binary AdaBoost Loss (Recap)

Binary AdaBoost minimizes the exponential loss:

L(y, f(x)) = exp(−y · f(x))    y ∈ {−1, +1}

The population minimizer (the Bayes-optimal f) satisfies:

f*(x) = ½ · log( P(y=1|x) / P(y=−1|x) )    (half the log-odds)

This connection to log-odds gives binary AdaBoost its probabilistic interpretation.

3.2 The Multi-Class Exponential Loss

For K classes, we represent predictions and labels as K-dimensional vectors subject to a sum-to-zero constraint.

Label encoding: for a sample with true class k, define:

yᵢ = (yᵢ₁, yᵢ₂, ..., yᵢK)ᵀ    where yᵢₖ = { 1       if class = k
                                                { −1/(K−1) otherwise

This encoding has two key properties:

The true class gets label +1
All other classes get label −1/(K−1)
The sum Σₖ yᵢₖ = 1 + (K−1)·(−1/(K−1)) = 1 − 1 = 0 ✓

The multi-class exponential loss for a vector-valued classifier f(x) = (f₁(x), ..., fK(x))ᵀ (also sum-to-zero):

L(y, f(x)) = exp( −(1/K) · yᵀf(x) )
           = exp( −(1/K) · Σₖ yₖ fₖ(x) )

The 1/K scaling ensures the loss is comparable across different K — without it, K-class problems would have K times the gradient magnitude of binary problems.

3.3 Margin in the Multi-Class Setting

The population minimizer of the multi-class exponential loss is:

fₖ*(x) = log P(y=k|x) − (1/K) Σⱼ log P(y=j|x)     for each class k

This is exactly the centered log-probability — the log-probability for class k, centered by subtracting the mean log-probability across all classes.

The final prediction rule:

H(x) = argmax_k fₖ*(x) = argmax_k log P(y=k|x) = argmax_k P(y=k|x)

The Bayes-optimal multi-class decision is to predict the class with the highest posterior probability — consistent with the binary case.

4. SAMME — Hard Label Multi-Class Boosting

4.1 Derivation of the Alpha Formula

At boosting round t, we have the current additive model F_{t-1}(x) and fit a new weak classifier hₜ(x) ∈ {1, 2, ..., K} with weight αₜ.

The weighted error rate is:

εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ]    (sum of weights on misclassified samples)

The weighted correct rate: 1 − εₜ.

To find the optimal αₜ, we minimize the exponential loss of the updated model. Using the FSAM framework (Forward Stagewise Additive Modeling), αₜ solves:

αₜ = argmin_α  Σᵢ exp( −(1/K) · yᵢᵀ [F_{t-1}(xᵢ) + α · T(hₜ(xᵢ))] )

Where T(k) is the indicator vector encoding: T(k)ₖ = 1, T(k)ⱼ = −1/(K−1) for j ≠ k.

Working through the algebra (noting that yᵢᵀT(hₜ(xᵢ)) = K/(K−1) if hₜ(xᵢ) = yᵢ, else −K/(K−1)·1/(K−1)):

After substitution and differentiation with respect to α:

∂/∂α Σᵢ wᵢ · exp(−(1/K) · yᵢᵀ α T(hₜ(xᵢ))) = 0

This yields:

αₜ = (K−1)/K · ln((1 − εₜ)/εₜ) + (K−1)/K · ln(K−1)
   = (K−1)/K · [ ln((1−εₜ)/εₜ) + ln(K−1) ]

The (K−1)/K scaling factor is a constant that drops out of the argmax at prediction time. The standard form drops it:

αₜ = ln((1−εₜ)/εₜ) + ln(K−1)

4.2 The Critical ln(K−1) Term

The ln(K−1) term is what distinguishes SAMME from a naive multi-class extension. Compare:

Binary AdaBoost:  αₜ = ½ · ln((1−εₜ)/εₜ)                    (positive iff εₜ < 0.5)
SAMME:            αₜ = ln((1−εₜ)/εₜ) + ln(K−1)              (positive iff εₜ < 1−1/K)

For SAMME, αₜ > 0 if and only if:

ln((1−εₜ)/εₜ) + ln(K−1) > 0
⟺  εₜ < K/(K+K−1) ... simplifying ...
⟺  εₜ < 1 − 1/K

Interpretation: A K-class random classifier achieves error rate (K−1)/K. SAMME requires each weak learner to beat random guessing — εₜ < 1 − 1/K — which is the correct threshold for multi-class problems (not 0.5, which only applies to binary).

K	Random error	Required threshold	ln(K−1)
2	0.50	< 0.50	0.000
3	0.67	< 0.67	0.693
5	0.80	< 0.80	1.386
10	0.90	< 0.90	2.197
100	0.99	< 0.99	4.605

Without the ln(K−1) correction, the alpha formula would require εₜ < 0.5 — far too strict for multi-class problems where a good weak learner might have error rate 0.7 (still well above random chance of 0.9 for 10 classes).

4.3 Weight Update Rule

After computing αₜ, update sample weights:

wᵢ ← wᵢ · exp(αₜ · 𝟙[hₜ(xᵢ) ≠ yᵢ])

Note: the update only involves whether hₜ misclassifies sample i — not which class it predicted. This is a simplification that SAMME inherits from the hard-label setting; SAMME.R will use the full probability vector.

Normalize: wᵢ ← wᵢ / Σᵢ wᵢ.

4.4 Full SAMME Algorithm

Input: Training data {(x₁,y₁),...,(xₘ,yₘ)}, yᵢ ∈ {1,...,K}
       Number of rounds T, weak learner WL

Initialize: wᵢ = 1/m  for all i

For t = 1 to T:

    1. Train weak learner on weighted data:
       hₜ = WL( {(xᵢ, yᵢ, wᵢ)} )

    2. Compute weighted error:
       εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ]

    3. If εₜ ≥ 1 − 1/K: stop or resample

    4. Compute learner weight:
       αₜ = ln((1−εₜ)/εₜ) + ln(K−1)

    5. Update sample weights:
       wᵢ ← wᵢ · exp(αₜ · 𝟙[hₜ(xᵢ) ≠ yᵢ])
       Normalize: wᵢ ← wᵢ / Σⱼ wⱼ

Output: H(x) = argmax_k  Σₜ αₜ · 𝟙[hₜ(x) = k]

The final decision: for each class k, sum the alpha weights of all rounds where hₜ predicted k. Predict the class with the highest total weight.

5. SAMME.R — Soft Probability Boosting

5.1 Why Use Probabilities Instead of Labels?

SAMME uses only the binary signal "correct/incorrect" from each weak learner — it ignores how confident the learner is. A learner that correctly predicts class 3 with probability 0.51 and one that predicts it with probability 0.99 both contribute equally under SAMME.

SAMME.R exploits the full probability vector p̂(x) = (p̂₁(x), ..., p̂K(x)) from the weak learner. The probability vector carries much more information than the hard label — particularly about which wrong classes are being confused.

Key requirement: The weak learner must implement predict_proba and produce reasonably calibrated probability estimates.

5.2 The SAMME.R Update Derivation

Instead of fitting a scalar weight αₜ times a hard-label indicator, SAMME.R fits a vector-valued update directly in the K-dimensional output space.

At each round t, the update to the additive model is a vector function h̃ₜ(x):

h̃ₜ(x)ₖ = (K−1)/K · [ log p̂ₖ(x) − (1/K) Σⱼ log p̂ⱼ(x) ]

This is exactly the centered log-probability of the weak learner's output — the same form as the Bayes-optimal solution in Section 3.3.

Why this form? The FSAM framework asks: given the current model F_{t-1}(x), what vector-valued function h̃ minimizes the weighted exponential loss?

The solution is:

h̃*(x) = argmin_{h̃} Σᵢ wᵢ · exp(−(1/K) · yᵢᵀ h̃(x))
       subject to: Σₖ h̃ₖ(x) = 0

When the weak learner produces probability estimates p̂ₖ(x), the optimal h̃ evaluated at the current weights is exactly the centered log-probability formula above.

Model update:

F_t(x) = F_{t-1}(x) + α · h̃ₜ(x)       (α = learning rate, default 1)

There is no scalar alpha to compute — the update magnitude is embedded in the log-probability magnitudes. A confident prediction (p̂ₖ → 1) gives a large update; an uncertain prediction (p̂ₖ ≈ 1/K) gives an update near zero.

5.3 Full SAMME.R Algorithm

Input: Training data {(xᵢ, yᵢ)}, K classes
       Number of rounds T, weak learner WL with predict_proba

Initialize: wᵢ = 1/m  for all i

For t = 1 to T:

    1. Train weak learner on weighted data:
       hₜ = WL( {(xᵢ, yᵢ, wᵢ)} )

    2. Get class probability estimates:
       p̂ₖ(xᵢ) = P̂(y=k | xᵢ)  for all i, k    (from hₜ.predict_proba)
       Clip to avoid log(0): p̂ₖ ← max(p̂ₖ, ε)

    3. Compute vector update for each sample:
       h̃ₜ(xᵢ)ₖ = (K−1)/K · [log p̂ₖ(xᵢ) − (1/K) Σⱼ log p̂ⱼ(xᵢ)]

    4. Update model:
       Fₜ(xᵢ)ₖ = F_{t-1}(xᵢ)ₖ + h̃ₜ(xᵢ)ₖ   for all k

    5. Update sample weights:
       wᵢ ← wᵢ · exp(−(K−1)/K · Σₖ yᵢₖ · log p̂ₖ(xᵢ))
       Normalize: wᵢ ← wᵢ / Σⱼ wⱼ

Output: H(x) = argmax_k  Fₜ(x)ₖ
        P(y=k|x) = exp(Fₜ(x)ₖ) / Σⱼ exp(Fₜ(x)ⱼ)   (softmax for probabilities)

The weight update in step 5 uses the cross-entropy between the true label indicator yᵢ and the predicted log-probabilities — samples that are confidently and correctly classified get downweighted most; samples that are confidently wrong get upweighted most.

6. SAMME vs. SAMME.R — Deep Comparison

Property	SAMME	SAMME.R
Weak learner output	Hard labels hₜ(x) ∈	Probability vector p̂(x) ∈ ΔK
Update type	Scalar α × indicator vector	Vector-valued log-probability update
Alpha formula	ln((1−ε)/ε) + ln(K−1)	None — magnitude embedded in log-probs
Weight update signal	Binary correct/incorrect	Full probability distribution
Information used	Which class was predicted (hard)	How confident + which class (soft)
Convergence speed	Slower — one bit of info per round	Faster — K−1 dimensions of info per round
Calibration sensitivity	None — only uses argmax	High — needs calibrated p̂
Weak learner requirement	Only `predict`	Must have `predict_proba`
Theoretical basis	FSAM with multi-class exp loss	FSAM — optimal vector step
With miscalibrated probs	Unaffected	Can underperform SAMME
Sklearn support	✅ `algorithm='SAMME'`	✅ `algorithm='SAMME.R'` (was default)

Key empirical finding (Zhu et al., 2009): SAMME.R consistently reaches lower test error in fewer boosting rounds than SAMME — often 2–5x fewer rounds for the same accuracy. The additional probability information per round dramatically accelerates convergence.

When SAMME is preferred:

Weak learner does not implement predict_proba
Probability estimates are known to be poorly calibrated
Computational budget is measured in rounds, not time (stumps are so fast both converge quickly)

7. Connection to Forward Stagewise Additive Modeling (FSAM)

Both SAMME and SAMME.R are instances of Forward Stagewise Additive Modeling:

At each stage t, find (αₜ, hₜ) to minimize:
    Σᵢ L(yᵢ, F_{t-1}(xᵢ) + αₜ · hₜ(xᵢ))

Don't go back and adjust previous stages.

This greedy one-stage-at-a-time approach is what makes both algorithms tractable. Optimizing all stages simultaneously would be intractable.

Connection to gradient boosting:

Both SAMME and SAMME.R can be viewed as gradient boosting with the multi-class exponential loss:

SAMME:   Fits trees to hard-label approximation of negative gradient
SAMME.R: Fits trees to exact negative gradient direction (probability-weighted)

SAMME.R is strictly closer to ideal gradient boosting — it uses the exact gradient direction, whereas SAMME approximates it with a hard label.

This is why SAMME.R relates to gradient boosted trees: setting loss='exponential' in sklearn's GradientBoostingClassifier with multi-class output gives a procedure very similar to SAMME.R.

8. Convergence Properties

SAMME Training Error Bound

The SAMME paper proves an exponential decay in training error:

Training error ≤ exp( −2 Σₜ γₜ² )

Where γₜ = (1−εₜ) − 1/K is the "edge" of the weak learner above random chance.

As long as each weak learner has γₜ > 0 (beats random guessing), training error decreases exponentially — the same qualitative result as binary AdaBoost.

SAMME.R Convergence

SAMME.R has a tighter convergence bound because it uses the full probability vector. Each round of SAMME.R reduces the multi-class exponential loss by at least:

ΔL ≥ (K−1)²/K · (KL divergence between true class distribution and p̂)

When weak learners have good probability estimates (high KL divergence from uniform), SAMME.R reduces loss much faster per round than SAMME.

9. Decision Boundary Geometry

Both SAMME and SAMME.R produce piecewise-linear decision boundaries when using decision stumps, or piecewise-polynomial boundaries with deeper trees — identical to binary AdaBoost in structure.

The K-class boundary between class j and class k is where:

Σₜ αₜ · 𝟙[hₜ(x) = j]  =  Σₜ αₜ · 𝟙[hₜ(x) = k]     (SAMME)
F_T(x)_j = F_T(x)_k                                    (SAMME.R)

For SAMME.R, the output F_T(x) is a K-dimensional vector, and the boundaries are where two components are equal — forming a partition of the feature space into K Voronoi-like regions in the function space.

With many rounds and complex base learners, both methods can approximate arbitrarily complex multi-class boundaries.

10. The Bias-Variance Profile

Configuration	Bias	Variance	Notes
Few rounds (T small)	High	Low	Ensemble too simple
Many rounds (T large, clean)	Low	Low	Margin increases, good generalization
Many rounds (T large, noisy)	Low	High	Noisy samples upweighted → overfit
SAMME.R + miscalibrated probs	High	Medium	Wrong probability estimates mislead

The margin theory for SAMME extends binary AdaBoost's result: the multi-class generalization error is bounded by the distribution of multi-class margins. Adding rounds increases the minimum margin even after training error reaches zero — explaining why continued training doesn't overfit on clean data.

11. Assumptions

Assumption	SAMME	SAMME.R
Weak learner beats random	εₜ < 1 − 1/K	p̂ better than uniform
IID samples	✅ Required	✅ Required
Clean labels	✅ Sensitive to noise	✅ Even more sensitive
Calibrated probabilities	Not required	✅ Required for SAMME.R
Sufficient weak learner capacity	✅ Must exceed 1−1/K	✅ Must produce informative p̂
No feature scaling	✅ Tree-based base learner	✅ Same

12. Evaluation Metrics

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, f1_score,
                              classification_report, log_loss)

# SAMME
clf_samme = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm='SAMME',
    random_state=42
)

# SAMME.R (requires predict_proba in base estimator)
clf_sammer = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm='SAMME.R',
    random_state=42
)

clf_sammer.fit(X_train, y_train)

# Metrics
y_pred  = clf_sammer.predict(X_test)
y_proba = clf_sammer.predict_proba(X_test)

print(classification_report(y_test, y_pred))
print(f"Log Loss: {log_loss(y_test, y_proba):.4f}")

# Staged evaluation — find optimal T
from sklearn.metrics import accuracy_score
staged_acc = [accuracy_score(y_test, p)
              for p in clf_sammer.staged_predict(X_test)]

13. Advantages

✅ Principled Multi-Class Extension

SAMME is not an ad hoc extension of binary AdaBoost — it is derived from the correct multi-class exponential loss. The ln(K−1) correction is mathematically necessary and theoretically justified.

✅ SAMME.R Faster Convergence

SAMME.R typically needs 2–5× fewer rounds to reach the same accuracy as SAMME, because it uses the full probability vector rather than a single bit per round.

✅ Native K-Class Support

No OvR or OvO strategies needed. The algorithm directly optimizes K-class loss — avoiding the calibration problems and computational overhead of decomposition approaches.

✅ Interpretable Weak Learners

Decision stumps remain interpretable. Each round adds one stump, and the final model is a weighted sum over T simple rules.

✅ Proven Convergence Guarantees

SAMME has the same exponential training error decay as binary AdaBoost. Well-understood theoretically.

✅ Staged Prediction Available

staged_predict / staged_predict_proba allow learning curve analysis and optimal round selection — same as binary AdaBoost.

✅ Probability Output

SAMME.R's output F_T(x) directly gives log-probabilities — softmax produces well-calibrated class probabilities (when base learners are calibrated).

14. Drawbacks & Limitations

❌ Sensitive to Label Noise

Inherits binary AdaBoost's catastrophic sensitivity to noisy labels. In K-class settings, this is even more problematic because there are K−1 ways to be wrong. A mislabeled sample will be upweighted exponentially.

❌ SAMME.R Requires Calibrated Probabilities

If the base learner's probability estimates are poorly calibrated (common with decision stumps), SAMME.R's weight updates are based on wrong information. A stump that produces probability 0.99/0.01 for all predictions despite being only 55% accurate will produce massive, misleading updates.

Check calibration:

from sklearn.calibration import calibration_curve
frac_pos, mean_pred = calibration_curve(y_test == k, proba[:, k], n_bins=10)

❌ Sequential Training

Cannot parallelize across rounds — each depends on the weights from the previous.

❌ Outperformed by Gradient Boosting

For most multi-class tabular problems, gradient boosted trees (XGBoost, LightGBM with K-class softmax) outperform SAMME/SAMME.R. Gradient boosting generalizes the exponential loss to arbitrary losses and uses second-order updates.

❌ Limited Regularization

No native L1/L2 regularization, no subsampling, no feature sampling. The learning_rate parameter provides shrinkage, but the regularization toolkit is thin compared to GBT.

15. SAMME/SAMME.R vs. Other Multi-Class Methods

Property	SAMME	SAMME.R	GBT Softmax	OvR LR	OvO SVM
Native multi-class	✅	✅	✅	❌ (K models)	❌ (K(K-1)/2)
Base learner type	Any	Any + proba	Trees	Linear	SVM
Noise robustness	❌ Poor	❌ Poor	⚠️ Moderate	✅ Good	✅ Good
Calibrated probs	⚠️ Moderate	✅ Good (if calib)	✅ Good	✅ Good	❌ Poor
Training speed	✅ Fast	✅ Fast	⚠️ Moderate	✅ Fast	❌ Slow
Accuracy (tabular)	⚠️ Good	✅ Good	✅✅ Best	⚠️ Moderate	⚠️ Moderate
Many classes (K>10)	⚠️ Slower (K trees/round)	⚠️ Same	❌ Very slow	⚠️ K models	❌ K² models
Overfitting (noisy)	❌ High	❌ High	⚠️ Moderate	✅ Low	✅ Low

16. Practical Tips & Gotchas

Basic Setup

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# SAMME — use when base learner has no predict_proba
clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=300,
    learning_rate=0.5,
    algorithm='SAMME',
    random_state=42
)

# SAMME.R — use when base learner has calibrated predict_proba
clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=300,
    learning_rate=0.5,
    algorithm='SAMME.R',
    random_state=42
)
clf.fit(X_train, y_train)

Find Optimal Number of Rounds

import numpy as np
from sklearn.metrics import accuracy_score

clf = AdaBoostClassifier(n_estimators=500, algorithm='SAMME.R')
clf.fit(X_train, y_train)

staged_val = [accuracy_score(y_val, p)
              for p in clf.staged_predict(X_val)]

optimal_T = np.argmax(staged_val) + 1
print(f"Optimal rounds: {optimal_T}")

# Retrain with optimal T
clf_final = AdaBoostClassifier(n_estimators=optimal_T, algorithm='SAMME.R')
clf_final.fit(X_train, y_train)

Check if SAMME.R Is Appropriate

from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt

# Check calibration of a single stump
from sklearn.tree import DecisionTreeClassifier
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X_train, y_train)
proba = stump.predict_proba(X_val)

# For each class
for k in range(n_classes):
    frac_pos, mean_pred = calibration_curve(
        (y_val == k).astype(int), proba[:, k], n_bins=5
    )
    plt.plot(mean_pred, frac_pos, label=f'Class {k}')
plt.plot([0,1],[0,1],'k--', label='Perfect')
plt.legend(); plt.title('Stump Calibration — Is SAMME.R Appropriate?')

If calibration is poor, use SAMME or pre-calibrate the base estimator:

from sklearn.calibration import CalibratedClassifierCV

calibrated_stump = CalibratedClassifierCV(
    DecisionTreeClassifier(max_depth=1), method='isotonic', cv=3
)
clf = AdaBoostClassifier(
    estimator=calibrated_stump,
    n_estimators=200,
    algorithm='SAMME.R'
)

Tune learning_rate and n_estimators Together

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators':  [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1.0]
}
# Lower LR + more rounds is almost always better
# Start with LR=0.1, n=200; then try LR=0.01, n=1000

Multi-Class Probabilities from SAMME.R

# SAMME.R softmax output
proba = clf.predict_proba(X_test)    # Shape: (n_samples, K)

# For log-odds (raw F scores before softmax)
decision = clf.decision_function(X_test)  # Shape: (n_samples, K)

17. When to Use Them

Use SAMME when:

Base learner does not implement predict_proba (e.g., some SVM variants, non-probabilistic classifiers)
Probability estimates from the base learner are poorly calibrated
You want a theoretically correct multi-class boosting algorithm with hard labels
K is large (> 20 classes) and calibrated probabilities are hard to obtain

Use SAMME.R when:

Base learner has well-calibrated probabilities (decision trees do; stumps marginally)
Faster convergence is needed (fewer rounds for same accuracy)
K is moderate (2–20 classes)
You want better probability output from the final model
You need the algorithm that sklearn uses as its default (or historically used)

Prefer gradient boosting (XGBoost/LightGBM softmax) when:

Maximum accuracy on tabular multi-class data is required
Noise robustness is important (GBT with log-loss is far less sensitive)
K is large (GBT with softmax scales better than SAMME's K trees/round)
Regularization is needed — GBT has far more tools

Summary

┌──────────────────────────────────────────────────────────────────────┐
│              SAMME / SAMME.R AT A GLANCE                            │
├──────────────────────────────────────────────────────────────────────┤
│  LOSS          Multi-class exponential: exp(−(1/K)·yᵀf(x))         │
│  SAME ALPHA    ln((1−ε)/ε) + ln(K−1)    [hard label update]        │
│  SAMMER UPDATE (K−1)/K · [log p̂ₖ − mean_j log p̂ⱼ] [prob update]  │
│  KEY TERM      ln(K−1): correct threshold for K-class random chance │
│  CONVERGENCE   Exponential decay in training error (both)           │
│  SAMMER EDGE   2–5× faster convergence using probability info       │
│  WEAKNESS      Noise-sensitive; SAMME.R needs calibrated probs      │
│  vs GBT        Principled but weaker; GBT dominates tabular tasks   │
│  BEST FOR      Multi-class with probabilistic weak learners          │
└──────────────────────────────────────────────────────────────────────┘

SAMME and SAMME.R represent the completion of AdaBoost's theoretical program. Binary AdaBoost answered "how do we boost binary classifiers?" — and the answer turned out to be intimately tied to the binary exponential loss. SAMME answered "what does that mean for K classes?" — and the answer required deriving a new loss, a new margin concept, and the surprising ln(K−1) correction that distinguishes genuine multi-class boosting from a naive binary re-application. SAMME.R went further and asked "what if the weak learner tells us more than a single label?" — and found that the full probability vector gives the optimal gradient direction in the K-dimensional function space. Together, they close the circle from AdaBoost to gradient boosting for multi-class problems.