Lpboost and Totalboost

Linear Programming Boosting and Totally Corrective Boosting

"What if boosting were an optimization problem you could solve exactly?"

1. What Are LPBoost and TotalBoost?

LPBoost (Linear Programming Boost) and TotalBoost (Totally Corrective Boosting) are boosting algorithms that reformulate the boosting problem as exact optimization problems — linear programming for LPBoost and entropy-regularized optimization for TotalBoost — rather than the greedy stagewise approximations used by AdaBoost.

Both share a common theoretical lens: margin maximization. Instead of asking "how do I reduce training loss one step at a time?", they ask "how do I find the combination of weak classifiers that maximizes the minimum margin over all training examples?"

Property	LPBoost	TotalBoost
Introduced by	Demiriz, Bennett, Shawe-Taylor (2002)	Warmuth, Liao, Rätsch (2006)
Objective	Maximize minimum margin	Maximize minimum margin (entropy reg)
Optimization method	Linear programming + column generation	Entropy regularization (online opt.)
Classifier weights	Arbitrary non-negative	Probability simplex (sum to 1)
Re-optimization	✅ All previous weights re-optimized	✅ All previous weights re-optimized
AdaBoost relationship	Implicit margin maximization	Explicit connection via entropy
Theoretical guarantee	Finds max-margin solution exactly	Finds max-margin solution (entropy approx.)
Practical use	Sparse ensembles	Competitive with AdaBoost

Key distinction from AdaBoost and GBT:

AdaBoost and gradient boosting are forward stagewise — they add one weak classifier per round and never revisit previous weights. LPBoost and TotalBoost are totally corrective — after adding each new weak classifier, they re-optimize all classifier weights simultaneously.

2. The Unifying Framework: Margin Maximization

2.1 The Ensemble Margin

Given an ensemble of T weak classifiers h₁, h₂, ..., hₜ with non-negative weights λ₁, λ₂, ..., λₜ (summing to 1), the ensemble margin for training sample (xᵢ, yᵢ) is:

margin(xᵢ, yᵢ) = yᵢ · Σₜ λₜ hₜ(xᵢ)

Where hₜ(xᵢ) ∈ {−1, +1}. For a correctly classified sample with high confidence, the margin is large and positive. For a misclassified sample, it is negative.

The minimum margin over all training samples:

ρ = min_i  yᵢ · Σₜ λₜ hₜ(xᵢ)

2.2 Why Maximizing the Minimum Margin Matters

The generalization error of an ensemble is bounded by:

P(error) ≤ (1/m) · |{i : margin(xᵢ, yᵢ) ≤ θ}| + O(√(VC_dim / m) / θ)

For a fixed hypothesis class, maximizing θ (the minimum margin) minimizes the bound. This is the ensemble analog of SVM's margin maximization — and it provides the same theoretical motivation.

AdaBoost's implicit relationship to margins: Schapire et al. (1998) proved that AdaBoost implicitly tends to maximize margins, even without explicitly formulating a margin maximization problem. However, it does not guarantee finding the maximum margin solution.

LPBoost and TotalBoost explicitly maximize the margin — they solve the problem AdaBoost only approaches asymptotically.

3. LPBoost — Linear Programming Boosting

3.1 The Primal LP Problem

Given a set of weak classifiers {h₁, ..., hₙ} (possibly very large — all possible stumps), LPBoost seeks weights λₜ ≥ 0 that maximize the minimum margin:

Hard margin (no misclassification allowed):

maximize   ρ
subject to:
    yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ      for all i = 1,...,m     (margin constraint)
    Σₜ λₜ = 1                                              (simplex constraint)
    λₜ ≥ 0                        for all t               (non-negativity)

Variables: ρ ∈ ℝ (margin), λ ∈ ℝᴺ (classifier weights).

This is a linear program with m + 1 constraints (margin + normalization) and N + 1 variables (one per classifier + ρ). For N = all possible decision stumps, N can be exponentially large — but this is handled by column generation.

3.2 The Dual LP Problem

The dual of the hard-margin LP reveals the connection to AdaBoost:

Primal (in matrix form):

max ρ  s.t.  H^T λ ≥ ρ 1, 1^T λ = 1, λ ≥ 0

Where H is the m × N matrix with H_{it} = yᵢ hₜ(xᵢ).

Dual:

min  Σᵢ dᵢ
subject to:
    Σᵢ dᵢ yᵢ hₜ(xᵢ) ≤ 0      for all t         (weak learner bound)
    Σᵢ dᵢ = 1                                     (normalization)
    dᵢ ≥ 0                      for all i         (non-negativity)

The dual variables dᵢ are sample weights — identical to AdaBoost's sample weights D(i).

The dual constraint Σᵢ dᵢ yᵢ hₜ(xᵢ) ≤ 0 says: at the optimal solution, no weak classifier hₜ achieves positive weighted margin — all selected classifiers are "used up" at the margin.

Strong duality (LP has no duality gap):

max ρ  =  min Σᵢ dᵢ  (when both are feasible)

The primal maximum margin equals the dual minimum objective — a beautiful symmetry.

3.3 Column Generation

The LP has N variables — one per possible weak classifier. With N = all possible stumps, this is exponentially large. Column generation solves this efficiently:

Start with a small initial set of classifiers (or just the first stump).

Loop:
    1. Solve the restricted LP (using only current classifiers)
    2. Get current dual variables (sample weights) d* from the LP dual
    3. Find the new column (classifier) that most violates the dual constraint:
       hₙₑₓₜ = argmax_{h} Σᵢ dᵢ* yᵢ h(xᵢ)
       (= find the weak learner with lowest weighted error rate under d*)
    4. If Σᵢ dᵢ* yᵢ hₙₑₓₜ(xᵢ) > ρ* + ε:  add hₙₑₓₜ to the LP and repeat
    5. Else: current solution is optimal — stop

Step 3 is equivalent to training a weak learner on weighted samples — exactly what AdaBoost does! The column generation oracle is AdaBoost's weak learner oracle.

Key insight: LPBoost and AdaBoost use the same weak learner oracle (find the classifier minimizing weighted error), but differ in what they do with it:

AdaBoost: Add the classifier with fixed weight αₜ and update sample weights heuristically
LPBoost: Add the classifier as a new LP variable, then re-solve the LP to find globally optimal weights

3.4 The LPBoost Algorithm

Input: Training data {(xᵢ, yᵢ)}, weak learner oracle WL, tolerance ε, ν (soft margin)

Initialize:
    Active set Λ = {}    (no classifiers yet)
    d = 1/m  for all i  (uniform sample weights)

Repeat:

    1. Find new weak classifier:
       hₙₑₓₜ = WL({(xᵢ, yᵢ, dᵢ)})    (minimize weighted error)

    2. Compute edge of hₙₑₓₜ:
       edge = Σᵢ dᵢ yᵢ hₙₑₓₜ(xᵢ) = 1 − 2εₙₑₓₜ

    3. Add to active set:
       Λ ← Λ ∪ {hₙₑₓₜ}

    4. Re-solve LP over all classifiers in Λ:
       (λ*, ρ*) = solve_LP(Λ)

    5. Update dual sample weights d* from LP solution

    6. Check stopping condition:
       edge ≤ ρ* + ε  →  converged, stop

Output: Ensemble H(x) = sign(Σₜ λₜ* hₜ(x))

After convergence, LPBoost has found:

The globally optimal classifier weights λₜ* for the maximum-margin ensemble
A sparse solution — many λₜ = 0 (the LP solution is at a vertex of the feasible polytope)

3.5 Soft Margin LPBoost

The hard-margin LP fails when data is not linearly separable by any ensemble. The soft margin extension introduces slack variables ξᵢ ≥ 0:

maximize   ρ − (1/νm) Σᵢ ξᵢ
subject to:
    yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ − ξᵢ      for all i
    Σₜ λₜ = 1
    λₜ ≥ 0,  ξᵢ ≥ 0

The regularization parameter ν ∈ (0, 1] controls the soft-margin trade-off:

ν → 0: Hard margin (no violations allowed)
ν → 1: Allow many violations (focus on average performance)
ν has an interpretation: ν is an upper bound on the fraction of margin violations and a lower bound on the fraction of support vectors

This is identical in spirit to SVM's soft margin (C parameter), with ν playing the role of 1/C.

ν ≈ 0.05:   Allow 5% margin violations — good for noisy data
ν ≈ 0.5:    Heavily regularized — larger margins, more misclassifications

4. TotalBoost — Totally Corrective Boosting

4.1 What Does "Totally Corrective" Mean?

"Totally corrective" refers to the re-optimization of all classifier weights after each round — as opposed to the "one step then never look back" approach of AdaBoost.

In AdaBoost:

Round t adds classifier hₜ with weight αₜ computed from the current round's error
Weights α₁, ..., αₜ₋₁ from previous rounds are never changed
This is an approximation — the optimal weight for h₁ should change as new classifiers are added

In TotalBoost (and LPBoost):

After adding each new hₜ, all weights λ₁, ..., λₜ are jointly re-optimized
The ensemble improves at every step not just by adding new information, but by redistributing weight among all classifiers

4.2 The TotalBoost Objective

TotalBoost maximizes the minimum margin with an entropy regularization term:

maximize   ρ − (1/β) · KL(λ || u)
subject to:
    yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ      for all i
    Σₜ λₜ = 1,   λₜ ≥ 0

Where:

KL(λ || u) = Σₜ λₜ log(λₜ / uₜ) is the KL divergence from λ to a uniform prior u = (1/T, ..., 1/T)
β > 0 is an inverse temperature parameter controlling the strength of regularization

The entropy term's effect:

No regularization (β → ∞):    LPBoost — pure margin maximization → sparse solution
Heavy regularization (β → 0): Maximum entropy → all classifiers get equal weight (uniform)

The KL term penalizes solutions that concentrate weight on few classifiers, encouraging diverse weight distributions. This is analogous to L2 regularization (SVM's regularization) but in the space of ensemble weights rather than kernel weights.

4.3 The TotalBoost Algorithm

Input: Training data {(xᵢ, yᵢ)}, weak learner WL
       Regularization β, tolerance ε

Initialize:
    λ = {}    (empty ensemble)
    d = 1/m   (uniform sample weights)

Repeat:

    1. Find new weak classifier:
       hₙₑₓₜ = WL({(xᵢ, yᵢ, dᵢ)})

    2. Add to ensemble:
       λ ← λ ∪ {λₙₑₓₜ with initial weight 0}

    3. Re-optimize ALL weights jointly:
       Solve: max_{λ,ρ} ρ − (1/β) KL(λ || u)
              s.t. yᵢ Σₜ λₜ hₜ(xᵢ) ≥ ρ,  Σλₜ=1, λₜ≥0

       (This is a convex optimization with closed-form dual)

    4. Update sample weights from dual:
       dᵢ ∝ exp(−β · yᵢ · Σₜ λₜ* hₜ(xᵢ))    (Gibbs distribution)

    5. Check convergence:
       If edge of hₙₑₓₜ ≤ ρ* + ε: stop

Output: H(x) = sign(Σₜ λₜ* hₜ(x))

4.4 Entropy Regularization

The key mathematical insight of TotalBoost: when entropy regularization is added to the LPBoost objective, the dual problem has a beautiful closed form.

The Lagrangian of the primal gives dual sample weights:

dᵢ* ∝ exp(−β · yᵢ · Σₜ λₜ hₜ(xᵢ))
     = exp(−β · margin(xᵢ, yᵢ))

This is a Gibbs/Boltzmann distribution over training samples, with temperature 1/β. Samples with large positive margin get exponentially down-weighted; samples near zero margin get up-weighted.

The connection to AdaBoost: When β → ∞ (no regularization) and using the forward stagewise approximation, this Gibbs distribution reduces to AdaBoost's exponential sample weights:

AdaBoost weights:     wᵢ ∝ exp(−yᵢ · Σₜ αₜ hₜ(xᵢ))
TotalBoost weights:   dᵢ ∝ exp(−β · yᵢ · Σₜ λₜ hₜ(xᵢ))

They are identical in form — the only difference is that TotalBoost uses globally optimal λₜ while AdaBoost uses greedy forward stagewise αₜ. AdaBoost is a greedy approximation to TotalBoost's exact optimization.

5. How AdaBoost Relates to Both

The relationship forms a clean hierarchy:

AdaBoost
├── Forward stagewise approximation to max-margin
├── Greedy: adds one classifier, fixes its weight, never re-optimizes
└── Implicitly maximizes margins (Schapire et al., 1998) — but not exactly

LPBoost
├── Exact LP formulation of the max-margin problem
├── Totally corrective: re-optimizes ALL weights after each round
└── Uses column generation — same weak learner oracle as AdaBoost

TotalBoost
├── Entropy-regularized max-margin (smoothed LPBoost)
├── Totally corrective: re-optimizes ALL weights
├── Dual sample weights are Gibbs distribution — same form as AdaBoost
└── AdaBoost = TotalBoost with greedy approximation + β → ∞

The progression: AdaBoost → TotalBoost → LPBoost represents increasing exactness of the optimization, at the cost of increasing computational complexity per round.

6. Column Generation — The Shared Mechanism

Both LPBoost and TotalBoost use column generation — a technique from combinatorial optimization for solving LP/convex programs with exponentially many variables:

Core idea:
    Instead of including all N possible classifiers in the LP from the start
    (which would be computationally intractable),
    solve the problem iteratively by adding one column (classifier) at a time.

At each iteration:
    1. Solve the current restricted problem (LP over active classifiers only)
    2. Check if any inactive column (classifier) can improve the objective
       → The "pricing problem" = train a weak learner on current dual weights
    3. If yes, add that classifier and re-solve
    4. If no improving column exists, the current solution is optimal

Optimality certificate:
    When no weak learner achieves edge > ρ* + ε,
    we have a certificate that the LP objective cannot be improved
    → Global optimality (for the current hypothesis class)

Column generation is exact — it provably finds the global optimum (subject to the hypothesis class and numerical precision). This is impossible with AdaBoost's greedy approach.

Practical implication: Column generation may require far fewer classifiers than AdaBoost to achieve the same margin. LPBoost solutions are often sparse — a small number of classifiers with non-zero weight captures the optimal ensemble. AdaBoost needs many more rounds to converge to the same margin.

7. Convergence Properties

LPBoost

Under mild conditions (the weak learner oracle always finds a classifier with edge ε above the current margin), LPBoost converges to the maximum-margin solution in:

T ≤ O(1/ε² · log m)  rounds

This is the same asymptotic bound as AdaBoost. However, LPBoost's totally corrective updates make the constant much smaller in practice — it typically uses 2–10× fewer rounds than AdaBoost for the same margin.

After convergence, the LP duality gap is zero — LPBoost has the optimal solution for the max-margin ensemble problem.

TotalBoost

TotalBoost converges to the entropy-regularized max-margin solution. The convergence rate depends on β:

For fixed β:  Converges to KL-regularized maximum margin in O(log(m) / ε²) rounds
β → ∞:        Converges to hard-margin LPBoost solution (but slower)

The entropy regularization makes TotalBoost's optimization strictly convex — guaranteeing unique solutions and smooth convergence, unlike the LP which may have non-unique optima.

8. Geometric Interpretation

LPBoost Geometry

The ensemble weights λ lie in the probability simplex Δ^N. The margin constraint defines a polyhedral feasible region. LPBoost finds the point in this polytope that maximizes the minimum margin:

Max-margin ensemble ↔ Chebyshev center of the constraint polytope

The LP solution is at a vertex of the feasible polytope — a sparse point where only a few classifiers have non-zero weight. This explains LPBoost's tendency to produce sparse ensembles.

TotalBoost Geometry

Adding the KL regularization transforms the feasible region from a polytope to a curved manifold. The optimal solution is in the interior of the simplex (not at a vertex) — all classifiers get some non-zero weight. The entropy term smoothly interpolates between the LP solution (sparse) and the uniform distribution (maximum entropy).

9. The Bias-Variance Profile

Both algorithms directly maximize margin, which is the primary driver of generalization in boosting:

Configuration	Bias	Variance	Notes
LPBoost, hard margin	Low	Medium	Exact max-margin, may overfit noise
LPBoost, soft margin (ν)	Low	Low	ν controls bias-variance trade-off
TotalBoost, large β	Low	Medium	Approaches LPBoost
TotalBoost, small β	Medium	Low	Heavy entropy reg. → smoother model

Generalization bound (margin theory):

For an ensemble achieving minimum margin ρ on training data:

P(test error) ≤ O(1 / (m · ρ²) + VC_dim · log(m) / m)

LPBoost and TotalBoost maximize ρ explicitly — they are the most principled way to achieve the smallest generalization bound among boosting algorithms.

10. Assumptions

Assumption	LPBoost	TotalBoost
Weak learner consistency	Must find improving column	Must produce useful proba.
Linear separability (hard margin)	Required for convergence	Not required (entropy reg.)
IID samples	✅ Standard	✅ Standard
Finite hypothesis class	Assumed (finite N) or oracle-based	Same
No feature scaling	✅ Tree-based	✅ Tree-based
Clean labels (hard margin)	✅ Very sensitive	⚠️ Better (soft margin)

11. Advantages

✅ Provable Optimality (LPBoost)

LPBoost produces a certified optimal solution for the maximum-margin ensemble — the best possible generalization bound for the given hypothesis class. No other boosting algorithm achieves this.

✅ Sparse Ensembles

LP solutions naturally produce sparse weight vectors — many classifiers get zero weight. Sparse ensembles are faster at prediction and easier to interpret.

✅ Totally Corrective Updates

Re-optimizing all weights after each round is theoretically superior to greedy forward stagewise. Fewer rounds needed, and the final model is globally optimized.

✅ Principled Regularization (TotalBoost)

The entropy regularization parameter β provides clean, theoretically grounded control of model complexity — analogous to SVM's C parameter but for ensembles.

✅ Connection to SVM Theory

Both algorithms are intimately connected to large-margin classification theory — the same theoretical tools that explain SVM generalization apply here.

✅ AdaBoost Connection (TotalBoost)

TotalBoost formally explains AdaBoost as a greedy approximation to an exact optimization — providing theoretical grounding for why AdaBoost works.

12. Drawbacks & Limitations

❌ Computationally Expensive Re-Optimization

The "totally corrective" re-optimization at each round requires solving an LP (LPBoost) or a convex program (TotalBoost) over all active classifiers. For T rounds and an LP with T variables, this is O(T²) LP solves — much more expensive than AdaBoost's O(T) operations.

❌ No Major Production Implementations

There is no widely maintained Python/sklearn implementation of LPBoost or TotalBoost. Research implementations exist but are not production-quality.

❌ LP Solver Dependency

LPBoost requires an LP solver (e.g., CVXPY, scipy.optimize.linprog, Gurobi). Adding a solver dependency to a production ML pipeline is often impractical.

❌ Hard Margin Fails on Noisy Data

The hard-margin LP is undefined when training data is not linearly separable by the ensemble — the LP is infeasible. Soft margin LPBoost addresses this but adds another hyperparameter (ν).

❌ Not Clearly Better in Practice Than Tuned AdaBoost

Despite provable optimality on the training margin, LPBoost and TotalBoost do not consistently outperform well-tuned AdaBoost or gradient boosting on benchmark datasets. The training margin optimality doesn't always translate to better test performance.

❌ Convergence Can Stall

Column generation may cycle or converge slowly if the LP oracle produces classifiers with very small improving edges. Careful ε thresholding is required.

13. LPBoost vs. TotalBoost vs. AdaBoost vs. GBT

Property	LPBoost	TotalBoost	AdaBoost	GBT
Optimization	Exact LP	Exact convex	Greedy stagewise	Greedy stagewise
Weight re-optimization	✅ All rounds	✅ All rounds	❌ Never	❌ Never
Margin maximization	✅ Exact	✅ Exact (reg.)	⚠️ Implicit	❌ Different objective
Loss function	0-1 margin	Entropy-margin	Exponential	Any differentiable
Sparsity	✅ Natural (LP vertex)	⚠️ Moderate	❌ Uses all T	❌ Uses all T
Noise robustness	⚠️ Hard margin sensitive	✅ Better (ent. reg.)	❌ Poor	✅ Good (log-loss)
Computational cost/round	❌ LP solve	❌ Convex program	✅ O(m·p)	✅ O(m·p)
Production implementations	❌ Research only	❌ Research only	✅ sklearn	✅✅ XGBoost/LGB
Theory (generalization)	✅ Best margin bound	✅ Best margin bound	⚠️ Good	⚠️ Loss-based
Practical accuracy	⚠️ Competitive	⚠️ Competitive	✅ Good	✅✅ Best tabular

14. Practical Guidance

Implementing LPBoost with scipy

import numpy as np
from scipy.optimize import linprog
from sklearn.tree import DecisionTreeClassifier

class LPBoost:
    def __init__(self, T=50, nu=0.1, epsilon=1e-4, max_depth=1):
        self.T = T
        self.nu = nu
        self.epsilon = epsilon
        self.max_depth = max_depth
        self.classifiers = []
        self.weights = []

    def fit(self, X, y):
        m = len(y)
        d = np.ones(m) / m   # Initial uniform weights
        self.classifiers = []
        self.weights = []

        for t in range(self.T):
            # Step 1: Train weak learner on weighted samples
            clf = DecisionTreeClassifier(max_depth=self.max_depth)
            clf.fit(X, y, sample_weight=d)
            preds = clf.predict(X)   # {-1, +1} or {0, 1}
            self.classifiers.append(clf)

            # Step 2: Compute margin matrix column for this classifier
            if t == 0:
                H = (y * preds).reshape(m, 1)
            else:
                H = np.column_stack([H, y * preds])

            # Step 3: Solve LP using linprog
            # Soft margin LP:
            # max rho - (1/(nu*m)) * sum(xi)
            # <=> min -rho + (1/(nu*m)) * sum(xi)
            # Variables: [lambda (T), rho (1), xi (m)]
            T_curr = H.shape[1]
            n_vars = T_curr + 1 + m

            c = np.zeros(n_vars)
            c[T_curr] = -1.0           # -rho (maximize rho)
            c[T_curr+1:] = 1.0/(self.nu * m)   # slack penalty

            # Constraint: H @ lambda >= rho - xi
            # => -H @ lambda + rho*1 - xi <= 0
            A_ub = np.zeros((m, n_vars))
            A_ub[:, :T_curr] = -H
            A_ub[:, T_curr] = 1.0
            A_ub[:, T_curr+1:] = -np.eye(m)
            b_ub = np.zeros(m)

            # Equality: sum(lambda) = 1
            A_eq = np.zeros((1, n_vars))
            A_eq[0, :T_curr] = 1.0
            b_eq = [1.0]

            # Bounds: lambda >= 0, rho free, xi >= 0
            bounds = [(0, None)] * T_curr + [(None, None)] + [(0, None)] * m

            result = linprog(c, A_ub=A_ub, b_ub=b_ub,
                             A_eq=A_eq, b_eq=b_eq,
                             bounds=bounds, method='highs')

            if not result.success:
                break

            lambda_opt = result.x[:T_curr]
            rho = result.x[T_curr]

            # Step 4: Compute dual weights (sample weights for next round)
            # Dual of LP gives updated d
            # Approximation: use margin-based reweighting
            margins = H @ lambda_opt
            d = np.maximum(rho - margins, 0)
            d = d / d.sum() if d.sum() > 0 else np.ones(m)/m

            # Check convergence: edge of new classifier
            edge = np.dot(d, y * preds)
            if edge <= rho + self.epsilon:
                break

        self.weights = lambda_opt
        self.rho = rho
        return self

    def predict(self, X):
        preds = np.zeros(len(X))
        for clf, w in zip(self.classifiers, self.weights):
            preds += w * clf.predict(X)
        return np.sign(preds)

# Usage
clf = LPBoost(T=50, nu=0.1)
clf.fit(X_train, y_train)
print(f"Number of non-zero classifiers: {(clf.weights > 1e-6).sum()}")

Implementing TotalBoost (Entropy Regularization)

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class TotalBoost:
    def __init__(self, T=100, beta=10.0, epsilon=1e-5, max_depth=1):
        self.T = T
        self.beta = beta
        self.epsilon = epsilon
        self.max_depth = max_depth
        self.classifiers = []
        self.weights = []

    def fit(self, X, y):
        m = len(y)
        d = np.ones(m) / m
        self.classifiers = []

        H = np.zeros((m, 0))   # Margin matrix

        for t in range(self.T):
            # Train weak learner
            clf = DecisionTreeClassifier(max_depth=self.max_depth)
            clf.fit(X, y, sample_weight=d)
            preds = clf.predict(X)
            self.classifiers.append(clf)

            # Add column to H
            H = np.column_stack([H, y * preds]) if H.size else (y * preds).reshape(-1, 1)

            # Entropy-regularized optimization (approximate):
            # Update lambda proportional to exp(-beta * margin)
            # This is the TotalBoost update in closed-form approximation
            T_curr = H.shape[1]
            lambda_curr = np.ones(T_curr) / T_curr   # Start uniform

            # Iterative projection (Sinkhorn-like)
            for _ in range(50):
                margins = H @ lambda_curr
                d_new = np.exp(-self.beta * margins)
                d_new /= d_new.sum()

                # Recompute lambda via dual relationship
                # lambda_t proportional to sum_i d_i * H_{it}
                scores = H.T @ d_new
                lambda_new = np.maximum(scores, 0)
                if lambda_new.sum() > 0:
                    lambda_new /= lambda_new.sum()
                else:
                    lambda_new = np.ones(T_curr) / T_curr

                if np.max(np.abs(lambda_new - lambda_curr)) < 1e-6:
                    break
                lambda_curr = lambda_new

            d = d_new
            self.weights = lambda_curr

            # Check convergence
            edge = H[:, -1] @ d
            rho = np.min(H @ self.weights)
            if edge <= rho + self.epsilon:
                break

        return self

    def predict(self, X):
        preds = np.zeros(len(X))
        for clf, w in zip(self.classifiers, self.weights):
            preds += w * clf.predict(X)
        return np.sign(preds)

clf = TotalBoost(T=100, beta=10.0)
clf.fit(X_train, y_train)

Using CVXPY for Cleaner LPBoost

import cvxpy as cp
import numpy as np

def solve_lpboost(H, nu):
    """
    H: (m, T) matrix — H[i,t] = y_i * h_t(x_i)
    nu: soft margin parameter
    Returns: (lambda, rho)
    """
    m, T = H.shape
    lam = cp.Variable(T, nonneg=True)
    rho = cp.Variable()
    xi  = cp.Variable(m, nonneg=True)

    objective = cp.Maximize(rho - (1/(nu*m)) * cp.sum(xi))
    constraints = [
        H @ lam >= rho - xi,    # Margin constraints with slack
        cp.sum(lam) == 1,        # Simplex constraint
    ]
    prob = cp.Problem(objective, constraints)
    prob.solve(solver=cp.ECOS)

    return lam.value, rho.value

15. When to Use Them

Use LPBoost when:

You need a provably optimal ensemble for the max-margin problem
Sparse ensembles are required for fast inference or interpretability
You have access to an LP solver (scipy, CVXPY, Gurobi)
Dataset is small to medium (LP re-solve cost is O(T²))
You are doing research on boosting theory or margin maximization
You want to understand the theoretical optimality of boosting methods

Use TotalBoost when:

You want the AdaBoost connection made explicit and precise
Entropy regularization provides better generalization than hard-margin LP
You need a smoother optimization landscape than LP
You are studying the theoretical relationship between AdaBoost and max-margin

Use AdaBoost instead for:

Production binary classification without an LP solver dependency
Fast training — O(m·p) per round vs. LPBoost's LP solve

Use Gradient Boosting (XGBoost/LightGBM) instead for:

Maximum accuracy on tabular data
Large datasets where LP re-solve is infeasible
Regression or multi-class problems at scale

Summary

┌──────────────────────────────────────────────────────────────────────┐
│                LPBOOST & TOTALBOOST AT A GLANCE                     │
├──────────────────────────────────────────────────────────────────────┤
│  SHARED GOAL    Maximize minimum margin over training data          │
│  SHARED MECH    Column generation = train weak learner on d*        │
│  SHARED PROP    Totally corrective: re-optimize ALL weights/round   │
│                                                                      │
│  LPBOOST        Exact LP: max ρ  s.t.  H·λ ≥ ρ, Σλ=1, λ≥0        │
│  LP RESULT      Sparse weights; zero duality gap; certified optimal │
│  SOFT MARGIN    + ξ slack + ν parameter for noise tolerance         │
│                                                                      │
│  TOTALBOOST     Entropy-reg: max ρ − (1/β)·KL(λ||u)               │
│  GIBBS WEIGHTS  dᵢ ∝ exp(−β · margin_i)  [Boltzmann distribution]  │
│  ADABOOST LINK  AdaBoost = TotalBoost with greedy approx + β→∞     │
│                                                                      │
│  STRENGTH       Optimality, sparsity, theory, margin certificate    │
│  WEAKNESS       LP solver needed, slow, no production impl.         │
│  BEST FOR       Research, small data, sparse ensembles              │
└──────────────────────────────────────────────────────────────────────┘

LPBoost and TotalBoost represent the mathematical completion of the boosting program. Where AdaBoost approximated, they optimize. Where AdaBoost guessed at weights, they solve for them. Where AdaBoost drifted toward a maximum-margin solution, they arrive there directly. The price is computational — every round requires solving a program instead of just training a stump — but the reward is a provable certificate: the ensemble you have is the best possible ensemble for the given hypothesis class. That guarantee is rare in machine learning, and for problems where theoretical rigor matters as much as empirical performance, LPBoost and TotalBoost offer something that neither AdaBoost nor gradient boosting can: proof.