SAMME & SAMME.R

SAMME & SAMME.R

Stagewise Additive Modeling using Multi-class Exponential Loss

1. What Are SAMME and SAMME.R?

SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss) and SAMME.R (the "Real" variant) are the canonical extensions of AdaBoost to K ≥ 2 class problems. They were introduced by Zhu, Zou, Rosset, and Hastie in 2009.

The key insight of this paper was that naive extensions of binary AdaBoost to multi-class problems are either incorrect (they ignore the structure of the K-class problem) or require solving K binary subproblems (OvR). SAMME derives from first principles what the correct alpha formula must be for a direct multi-class boosting algorithm, and SAMME.R extends this to use probabilistic weak learner outputs for dramatically faster convergence.

Property SAMME SAMME.R
Weak learner output Hard class labels Class probability estimates
Learner requirement Only predict Requires predict_proba
Convergence speed Slower (one update per round) Faster (uses full probability info)
Sensitivity to calibration Low High — needs calibrated probs
Alpha formula ½·log((1−εₜ)/εₜ) + log(K−1) Not a single scalar — vector update
sklearn default SAMME (sklearn ≥ 1.2 deprecated SAMME.R) Was previous default

2. The Problem: Extending AdaBoost to K > 2 Classes

Binary AdaBoost encodes labels as y ∈ {−1, +1} and classifiers as h(x) ∈ {−1, +1}. The final model is:

H(x) = sign( Σₜ αₜ hₜ(x) )

For K classes, the naive approaches have problems:

One-vs-Rest (OvR): Train K binary classifiers, each distinguishing class k from all others. Inefficient, doesn't naturally produce a coherent multi-class margin.

Naive extension: Just add more alpha terms. But what should alpha be for K classes? The binary derivation explicitly uses y ∈ {−1,+1} — it doesn't generalize directly.

AdaBoost.M1: The earliest multi-class extension — applies binary AdaBoost with K-class labels directly. Works only when ε < 0.5, which becomes harder to guarantee as K grows. Fails completely if any round produces ε ≥ 0.5.

AdaBoost.MH / AdaBoost.MO: Various alternatives that reformulate as multi-label or output code problems. Complex and not as principled.

SAMME solves this by re-deriving the alpha formula from the correct multi-class exponential loss rather than borrowing from the binary case.


3. Mathematical Foundation — Multi-Class Exponential Loss

3.1 The Binary AdaBoost Loss (Recap)

Binary AdaBoost minimizes the exponential loss:

L(y, f(x)) = exp(−y · f(x))    y ∈ {−1, +1}

The population minimizer (the Bayes-optimal f) satisfies:

f*(x) = ½ · log( P(y=1|x) / P(y=−1|x) )    (half the log-odds)

This connection to log-odds gives binary AdaBoost its probabilistic interpretation.


3.2 The Multi-Class Exponential Loss

For K classes, we represent predictions and labels as K-dimensional vectors subject to a sum-to-zero constraint.

Label encoding: for a sample with true class k, define:

yᵢ = (yᵢ₁, yᵢ₂, ..., yᵢK)ᵀ    where yᵢₖ = { 1       if class = k
                                                { −1/(K−1) otherwise

This encoding has two key properties:

The multi-class exponential loss for a vector-valued classifier f(x) = (f₁(x), ..., fK(x))ᵀ (also sum-to-zero):

L(y, f(x)) = exp( −(1/K) · yᵀf(x) )
           = exp( −(1/K) · Σₖ yₖ fₖ(x) )

The 1/K scaling ensures the loss is comparable across different K — without it, K-class problems would have K times the gradient magnitude of binary problems.


3.3 Margin in the Multi-Class Setting

The population minimizer of the multi-class exponential loss is:

fₖ*(x) = log P(y=k|x) − (1/K) Σⱼ log P(y=j|x)     for each class k

This is exactly the centered log-probability — the log-probability for class k, centered by subtracting the mean log-probability across all classes.

The final prediction rule:

H(x) = argmax_k fₖ*(x) = argmax_k log P(y=k|x) = argmax_k P(y=k|x)

The Bayes-optimal multi-class decision is to predict the class with the highest posterior probability — consistent with the binary case.


4. SAMME — Hard Label Multi-Class Boosting

4.1 Derivation of the Alpha Formula

At boosting round t, we have the current additive model F_{t-1}(x) and fit a new weak classifier hₜ(x) ∈ {1, 2, ..., K} with weight αₜ.

The weighted error rate is:

εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ]    (sum of weights on misclassified samples)

The weighted correct rate: 1 − εₜ.

To find the optimal αₜ, we minimize the exponential loss of the updated model. Using the FSAM framework (Forward Stagewise Additive Modeling), αₜ solves:

αₜ = argmin_α  Σᵢ exp( −(1/K) · yᵢᵀ [F_{t-1}(xᵢ) + α · T(hₜ(xᵢ))] )

Where T(k) is the indicator vector encoding: T(k)ₖ = 1, T(k)ⱼ = −1/(K−1) for j ≠ k.

Working through the algebra (noting that yᵢᵀT(hₜ(xᵢ)) = K/(K−1) if hₜ(xᵢ) = yᵢ, else −K/(K−1)·1/(K−1)):

After substitution and differentiation with respect to α:

∂/∂α Σᵢ wᵢ · exp(−(1/K) · yᵢᵀ α T(hₜ(xᵢ))) = 0

This yields:

αₜ = (K−1)/K · ln((1 − εₜ)/εₜ) + (K−1)/K · ln(K−1)
   = (K−1)/K · [ ln((1−εₜ)/εₜ) + ln(K−1) ]

The (K−1)/K scaling factor is a constant that drops out of the argmax at prediction time. The standard form drops it:

αₜ = ln((1−εₜ)/εₜ) + ln(K−1)

4.2 The Critical ln(K−1) Term

The ln(K−1) term is what distinguishes SAMME from a naive multi-class extension. Compare:

Binary AdaBoost:  αₜ = ½ · ln((1−εₜ)/εₜ)                    (positive iff εₜ < 0.5)
SAMME:            αₜ = ln((1−εₜ)/εₜ) + ln(K−1)              (positive iff εₜ < 1−1/K)

For SAMME, αₜ > 0 if and only if:

ln((1−εₜ)/εₜ) + ln(K−1) > 0
⟺  εₜ < K/(K+K−1) ... simplifying ...
⟺  εₜ < 1 − 1/K

Interpretation: A K-class random classifier achieves error rate (K−1)/K. SAMME requires each weak learner to beat random guessing — εₜ < 1 − 1/K — which is the correct threshold for multi-class problems (not 0.5, which only applies to binary).

K Random error Required threshold ln(K−1)
2 0.50 < 0.50 0.000
3 0.67 < 0.67 0.693
5 0.80 < 0.80 1.386
10 0.90 < 0.90 2.197
100 0.99 < 0.99 4.605

Without the ln(K−1) correction, the alpha formula would require εₜ < 0.5 — far too strict for multi-class problems where a good weak learner might have error rate 0.7 (still well above random chance of 0.9 for 10 classes).


4.3 Weight Update Rule

After computing αₜ, update sample weights:

wᵢ ← wᵢ · exp(αₜ · 𝟙[hₜ(xᵢ) ≠ yᵢ])

Note: the update only involves whether hₜ misclassifies sample i — not which class it predicted. This is a simplification that SAMME inherits from the hard-label setting; SAMME.R will use the full probability vector.

Normalize: wᵢ ← wᵢ / Σᵢ wᵢ.


4.4 Full SAMME Algorithm

Input: Training data {(x₁,y₁),...,(xₘ,yₘ)}, yᵢ ∈ {1,...,K}
       Number of rounds T, weak learner WL

Initialize: wᵢ = 1/m  for all i

For t = 1 to T:

    1. Train weak learner on weighted data:
       hₜ = WL( {(xᵢ, yᵢ, wᵢ)} )

    2. Compute weighted error:
       εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ]

    3. If εₜ ≥ 1 − 1/K: stop or resample

    4. Compute learner weight:
       αₜ = ln((1−εₜ)/εₜ) + ln(K−1)

    5. Update sample weights:
       wᵢ ← wᵢ · exp(αₜ · 𝟙[hₜ(xᵢ) ≠ yᵢ])
       Normalize: wᵢ ← wᵢ / Σⱼ wⱼ

Output: H(x) = argmax_k  Σₜ αₜ · 𝟙[hₜ(x) = k]

The final decision: for each class k, sum the alpha weights of all rounds where hₜ predicted k. Predict the class with the highest total weight.


5. SAMME.R — Soft Probability Boosting

5.1 Why Use Probabilities Instead of Labels?

SAMME uses only the binary signal "correct/incorrect" from each weak learner — it ignores how confident the learner is. A learner that correctly predicts class 3 with probability 0.51 and one that predicts it with probability 0.99 both contribute equally under SAMME.

SAMME.R exploits the full probability vector p̂(x) = (p̂₁(x), ..., p̂K(x)) from the weak learner. The probability vector carries much more information than the hard label — particularly about which wrong classes are being confused.

Key requirement: The weak learner must implement predict_proba and produce reasonably calibrated probability estimates.


5.2 The SAMME.R Update Derivation

Instead of fitting a scalar weight αₜ times a hard-label indicator, SAMME.R fits a vector-valued update directly in the K-dimensional output space.

At each round t, the update to the additive model is a vector function h̃ₜ(x):

h̃ₜ(x)ₖ = (K−1)/K · [ log p̂ₖ(x) − (1/K) Σⱼ log p̂ⱼ(x) ]

This is exactly the centered log-probability of the weak learner's output — the same form as the Bayes-optimal solution in Section 3.3.

Why this form? The FSAM framework asks: given the current model F_{t-1}(x), what vector-valued function h̃ minimizes the weighted exponential loss?

The solution is:

h̃*(x) = argmin_{h̃} Σᵢ wᵢ · exp(−(1/K) · yᵢᵀ h̃(x))
       subject to: Σₖ h̃ₖ(x) = 0

When the weak learner produces probability estimates p̂ₖ(x), the optimal h̃ evaluated at the current weights is exactly the centered log-probability formula above.

Model update:

F_t(x) = F_{t-1}(x) + α · h̃ₜ(x)       (α = learning rate, default 1)

There is no scalar alpha to compute — the update magnitude is embedded in the log-probability magnitudes. A confident prediction (p̂ₖ → 1) gives a large update; an uncertain prediction (p̂ₖ ≈ 1/K) gives an update near zero.


5.3 Full SAMME.R Algorithm

Input: Training data {(xᵢ, yᵢ)}, K classes
       Number of rounds T, weak learner WL with predict_proba

Initialize: wᵢ = 1/m  for all i

For t = 1 to T:

    1. Train weak learner on weighted data:
       hₜ = WL( {(xᵢ, yᵢ, wᵢ)} )

    2. Get class probability estimates:
       p̂ₖ(xᵢ) = P̂(y=k | xᵢ)  for all i, k    (from hₜ.predict_proba)
       Clip to avoid log(0): p̂ₖ ← max(p̂ₖ, ε)

    3. Compute vector update for each sample:
       h̃ₜ(xᵢ)ₖ = (K−1)/K · [log p̂ₖ(xᵢ) − (1/K) Σⱼ log p̂ⱼ(xᵢ)]

    4. Update model:
       Fₜ(xᵢ)ₖ = F_{t-1}(xᵢ)ₖ + h̃ₜ(xᵢ)ₖ   for all k

    5. Update sample weights:
       wᵢ ← wᵢ · exp(−(K−1)/K · Σₖ yᵢₖ · log p̂ₖ(xᵢ))
       Normalize: wᵢ ← wᵢ / Σⱼ wⱼ

Output: H(x) = argmax_k  Fₜ(x)ₖ
        P(y=k|x) = exp(Fₜ(x)ₖ) / Σⱼ exp(Fₜ(x)ⱼ)   (softmax for probabilities)

The weight update in step 5 uses the cross-entropy between the true label indicator yᵢ and the predicted log-probabilities — samples that are confidently and correctly classified get downweighted most; samples that are confidently wrong get upweighted most.


6. SAMME vs. SAMME.R — Deep Comparison

Property SAMME SAMME.R
Weak learner output Hard labels hₜ(x) ∈ Probability vector p̂(x) ∈ ΔK
Update type Scalar α × indicator vector Vector-valued log-probability update
Alpha formula ln((1−ε)/ε) + ln(K−1) None — magnitude embedded in log-probs
Weight update signal Binary correct/incorrect Full probability distribution
Information used Which class was predicted (hard) How confident + which class (soft)
Convergence speed Slower — one bit of info per round Faster — K−1 dimensions of info per round
Calibration sensitivity None — only uses argmax High — needs calibrated p̂
Weak learner requirement Only predict Must have predict_proba
Theoretical basis FSAM with multi-class exp loss FSAM — optimal vector step
With miscalibrated probs Unaffected Can underperform SAMME
Sklearn support algorithm='SAMME' algorithm='SAMME.R' (was default)

Key empirical finding (Zhu et al., 2009): SAMME.R consistently reaches lower test error in fewer boosting rounds than SAMME — often 2–5x fewer rounds for the same accuracy. The additional probability information per round dramatically accelerates convergence.

When SAMME is preferred:


7. Connection to Forward Stagewise Additive Modeling (FSAM)

Both SAMME and SAMME.R are instances of Forward Stagewise Additive Modeling:

At each stage t, find (αₜ, hₜ) to minimize:
    Σᵢ L(yᵢ, F_{t-1}(xᵢ) + αₜ · hₜ(xᵢ))

Don't go back and adjust previous stages.

This greedy one-stage-at-a-time approach is what makes both algorithms tractable. Optimizing all stages simultaneously would be intractable.

Connection to gradient boosting:

Both SAMME and SAMME.R can be viewed as gradient boosting with the multi-class exponential loss:

SAMME:   Fits trees to hard-label approximation of negative gradient
SAMME.R: Fits trees to exact negative gradient direction (probability-weighted)

SAMME.R is strictly closer to ideal gradient boosting — it uses the exact gradient direction, whereas SAMME approximates it with a hard label.

This is why SAMME.R relates to gradient boosted trees: setting loss='exponential' in sklearn's GradientBoostingClassifier with multi-class output gives a procedure very similar to SAMME.R.


8. Convergence Properties

SAMME Training Error Bound

The SAMME paper proves an exponential decay in training error:

Training error ≤ exp( −2 Σₜ γₜ² )

Where γₜ = (1−εₜ) − 1/K is the "edge" of the weak learner above random chance.

As long as each weak learner has γₜ > 0 (beats random guessing), training error decreases exponentially — the same qualitative result as binary AdaBoost.

SAMME.R Convergence

SAMME.R has a tighter convergence bound because it uses the full probability vector. Each round of SAMME.R reduces the multi-class exponential loss by at least:

ΔL ≥ (K−1)²/K · (KL divergence between true class distribution and p̂)

When weak learners have good probability estimates (high KL divergence from uniform), SAMME.R reduces loss much faster per round than SAMME.


9. Decision Boundary Geometry

Both SAMME and SAMME.R produce piecewise-linear decision boundaries when using decision stumps, or piecewise-polynomial boundaries with deeper trees — identical to binary AdaBoost in structure.

The K-class boundary between class j and class k is where:

Σₜ αₜ · 𝟙[hₜ(x) = j]  =  Σₜ αₜ · 𝟙[hₜ(x) = k]     (SAMME)
F_T(x)_j = F_T(x)_k                                    (SAMME.R)

For SAMME.R, the output F_T(x) is a K-dimensional vector, and the boundaries are where two components are equal — forming a partition of the feature space into K Voronoi-like regions in the function space.

With many rounds and complex base learners, both methods can approximate arbitrarily complex multi-class boundaries.


10. The Bias-Variance Profile

Configuration Bias Variance Notes
Few rounds (T small) High Low Ensemble too simple
Many rounds (T large, clean) Low Low Margin increases, good generalization
Many rounds (T large, noisy) Low High Noisy samples upweighted → overfit
SAMME.R + miscalibrated probs High Medium Wrong probability estimates mislead

The margin theory for SAMME extends binary AdaBoost's result: the multi-class generalization error is bounded by the distribution of multi-class margins. Adding rounds increases the minimum margin even after training error reaches zero — explaining why continued training doesn't overfit on clean data.


11. Assumptions

Assumption SAMME SAMME.R
Weak learner beats random εₜ < 1 − 1/K p̂ better than uniform
IID samples ✅ Required ✅ Required
Clean labels ✅ Sensitive to noise ✅ Even more sensitive
Calibrated probabilities Not required ✅ Required for SAMME.R
Sufficient weak learner capacity ✅ Must exceed 1−1/K ✅ Must produce informative p̂
No feature scaling ✅ Tree-based base learner ✅ Same

12. Evaluation Metrics

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, f1_score,
                              classification_report, log_loss)

# SAMME
clf_samme = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm='SAMME',
    random_state=42
)

# SAMME.R (requires predict_proba in base estimator)
clf_sammer = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    algorithm='SAMME.R',
    random_state=42
)

clf_sammer.fit(X_train, y_train)

# Metrics
y_pred  = clf_sammer.predict(X_test)
y_proba = clf_sammer.predict_proba(X_test)

print(classification_report(y_test, y_pred))
print(f"Log Loss: {log_loss(y_test, y_proba):.4f}")

# Staged evaluation — find optimal T
from sklearn.metrics import accuracy_score
staged_acc = [accuracy_score(y_test, p)
              for p in clf_sammer.staged_predict(X_test)]

13. Advantages

✅ Principled Multi-Class Extension

SAMME is not an ad hoc extension of binary AdaBoost — it is derived from the correct multi-class exponential loss. The ln(K−1) correction is mathematically necessary and theoretically justified.

✅ SAMME.R Faster Convergence

SAMME.R typically needs 2–5× fewer rounds to reach the same accuracy as SAMME, because it uses the full probability vector rather than a single bit per round.

✅ Native K-Class Support

No OvR or OvO strategies needed. The algorithm directly optimizes K-class loss — avoiding the calibration problems and computational overhead of decomposition approaches.

✅ Interpretable Weak Learners

Decision stumps remain interpretable. Each round adds one stump, and the final model is a weighted sum over T simple rules.

✅ Proven Convergence Guarantees

SAMME has the same exponential training error decay as binary AdaBoost. Well-understood theoretically.

✅ Staged Prediction Available

staged_predict / staged_predict_proba allow learning curve analysis and optimal round selection — same as binary AdaBoost.

✅ Probability Output

SAMME.R's output F_T(x) directly gives log-probabilities — softmax produces well-calibrated class probabilities (when base learners are calibrated).


14. Drawbacks & Limitations

❌ Sensitive to Label Noise

Inherits binary AdaBoost's catastrophic sensitivity to noisy labels. In K-class settings, this is even more problematic because there are K−1 ways to be wrong. A mislabeled sample will be upweighted exponentially.

❌ SAMME.R Requires Calibrated Probabilities

If the base learner's probability estimates are poorly calibrated (common with decision stumps), SAMME.R's weight updates are based on wrong information. A stump that produces probability 0.99/0.01 for all predictions despite being only 55% accurate will produce massive, misleading updates.

Check calibration:

from sklearn.calibration import calibration_curve
frac_pos, mean_pred = calibration_curve(y_test == k, proba[:, k], n_bins=10)

❌ Sequential Training

Cannot parallelize across rounds — each depends on the weights from the previous.

❌ Outperformed by Gradient Boosting

For most multi-class tabular problems, gradient boosted trees (XGBoost, LightGBM with K-class softmax) outperform SAMME/SAMME.R. Gradient boosting generalizes the exponential loss to arbitrary losses and uses second-order updates.

❌ Limited Regularization

No native L1/L2 regularization, no subsampling, no feature sampling. The learning_rate parameter provides shrinkage, but the regularization toolkit is thin compared to GBT.


15. SAMME/SAMME.R vs. Other Multi-Class Methods

Property SAMME SAMME.R GBT Softmax OvR LR OvO SVM
Native multi-class ❌ (K models) ❌ (K(K-1)/2)
Base learner type Any Any + proba Trees Linear SVM
Noise robustness ❌ Poor ❌ Poor ⚠️ Moderate ✅ Good ✅ Good
Calibrated probs ⚠️ Moderate ✅ Good (if calib) ✅ Good ✅ Good ❌ Poor
Training speed ✅ Fast ✅ Fast ⚠️ Moderate ✅ Fast ❌ Slow
Accuracy (tabular) ⚠️ Good ✅ Good ✅✅ Best ⚠️ Moderate ⚠️ Moderate
Many classes (K>10) ⚠️ Slower (K trees/round) ⚠️ Same ❌ Very slow ⚠️ K models ❌ K² models
Overfitting (noisy) ❌ High ❌ High ⚠️ Moderate ✅ Low ✅ Low

16. Practical Tips & Gotchas

Basic Setup

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# SAMME — use when base learner has no predict_proba
clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=300,
    learning_rate=0.5,
    algorithm='SAMME',
    random_state=42
)

# SAMME.R — use when base learner has calibrated predict_proba
clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),
    n_estimators=300,
    learning_rate=0.5,
    algorithm='SAMME.R',
    random_state=42
)
clf.fit(X_train, y_train)

Find Optimal Number of Rounds

import numpy as np
from sklearn.metrics import accuracy_score

clf = AdaBoostClassifier(n_estimators=500, algorithm='SAMME.R')
clf.fit(X_train, y_train)

staged_val = [accuracy_score(y_val, p)
              for p in clf.staged_predict(X_val)]

optimal_T = np.argmax(staged_val) + 1
print(f"Optimal rounds: {optimal_T}")

# Retrain with optimal T
clf_final = AdaBoostClassifier(n_estimators=optimal_T, algorithm='SAMME.R')
clf_final.fit(X_train, y_train)

Check if SAMME.R Is Appropriate

from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt

# Check calibration of a single stump
from sklearn.tree import DecisionTreeClassifier
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X_train, y_train)
proba = stump.predict_proba(X_val)

# For each class
for k in range(n_classes):
    frac_pos, mean_pred = calibration_curve(
        (y_val == k).astype(int), proba[:, k], n_bins=5
    )
    plt.plot(mean_pred, frac_pos, label=f'Class {k}')
plt.plot([0,1],[0,1],'k--', label='Perfect')
plt.legend(); plt.title('Stump Calibration — Is SAMME.R Appropriate?')

If calibration is poor, use SAMME or pre-calibrate the base estimator:

from sklearn.calibration import CalibratedClassifierCV

calibrated_stump = CalibratedClassifierCV(
    DecisionTreeClassifier(max_depth=1), method='isotonic', cv=3
)
clf = AdaBoostClassifier(
    estimator=calibrated_stump,
    n_estimators=200,
    algorithm='SAMME.R'
)

Tune learning_rate and n_estimators Together

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators':  [100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1.0]
}
# Lower LR + more rounds is almost always better
# Start with LR=0.1, n=200; then try LR=0.01, n=1000

Multi-Class Probabilities from SAMME.R

# SAMME.R softmax output
proba = clf.predict_proba(X_test)    # Shape: (n_samples, K)

# For log-odds (raw F scores before softmax)
decision = clf.decision_function(X_test)  # Shape: (n_samples, K)

17. When to Use Them

Use SAMME when:

Use SAMME.R when:

Prefer gradient boosting (XGBoost/LightGBM softmax) when:


Summary

┌──────────────────────────────────────────────────────────────────────┐
│              SAMME / SAMME.R AT A GLANCE                            │
├──────────────────────────────────────────────────────────────────────┤
│  LOSS          Multi-class exponential: exp(−(1/K)·yᵀf(x))         │
│  SAME ALPHA    ln((1−ε)/ε) + ln(K−1)    [hard label update]        │
│  SAMMER UPDATE (K−1)/K · [log p̂ₖ − mean_j log p̂ⱼ] [prob update]  │
│  KEY TERM      ln(K−1): correct threshold for K-class random chance │
│  CONVERGENCE   Exponential decay in training error (both)           │
│  SAMMER EDGE   2–5× faster convergence using probability info       │
│  WEAKNESS      Noise-sensitive; SAMME.R needs calibrated probs      │
│  vs GBT        Principled but weaker; GBT dominates tabular tasks   │
│  BEST FOR      Multi-class with probabilistic weak learners          │
└──────────────────────────────────────────────────────────────────────┘

SAMME and SAMME.R represent the completion of AdaBoost's theoretical program. Binary AdaBoost answered "how do we boost binary classifiers?" — and the answer turned out to be intimately tied to the binary exponential loss. SAMME answered "what does that mean for K classes?" — and the answer required deriving a new loss, a new margin concept, and the surprising ln(K−1) correction that distinguishes genuine multi-class boosting from a naive binary re-application. SAMME.R went further and asked "what if the weak learner tells us more than a single label?" — and found that the full probability vector gives the optimal gradient direction in the K-dimensional function space. Together, they close the circle from AdaBoost to gradient boosting for multi-class problems.

Powered by Forestry.md