Lpboost and Totalboost
Linear Programming Boosting and Totally Corrective Boosting
"What if boosting were an optimization problem you could solve exactly?"
1. What Are LPBoost and TotalBoost?
LPBoost (Linear Programming Boost) and TotalBoost (Totally Corrective Boosting) are boosting algorithms that reformulate the boosting problem as exact optimization problems — linear programming for LPBoost and entropy-regularized optimization for TotalBoost — rather than the greedy stagewise approximations used by AdaBoost.
Both share a common theoretical lens: margin maximization. Instead of asking "how do I reduce training loss one step at a time?", they ask "how do I find the combination of weak classifiers that maximizes the minimum margin over all training examples?"
| Property | LPBoost | TotalBoost |
|---|---|---|
| Introduced by | Demiriz, Bennett, Shawe-Taylor (2002) | Warmuth, Liao, Rätsch (2006) |
| Objective | Maximize minimum margin | Maximize minimum margin (entropy reg) |
| Optimization method | Linear programming + column generation | Entropy regularization (online opt.) |
| Classifier weights | Arbitrary non-negative | Probability simplex (sum to 1) |
| Re-optimization | ✅ All previous weights re-optimized | ✅ All previous weights re-optimized |
| AdaBoost relationship | Implicit margin maximization | Explicit connection via entropy |
| Theoretical guarantee | Finds max-margin solution exactly | Finds max-margin solution (entropy approx.) |
| Practical use | Sparse ensembles | Competitive with AdaBoost |
Key distinction from AdaBoost and GBT:
AdaBoost and gradient boosting are forward stagewise — they add one weak classifier per round and never revisit previous weights. LPBoost and TotalBoost are totally corrective — after adding each new weak classifier, they re-optimize all classifier weights simultaneously.
2. The Unifying Framework: Margin Maximization
2.1 The Ensemble Margin
Given an ensemble of T weak classifiers h₁, h₂, ..., hₜ with non-negative weights λ₁, λ₂, ..., λₜ (summing to 1), the ensemble margin for training sample (xᵢ, yᵢ) is:
margin(xᵢ, yᵢ) = yᵢ · Σₜ λₜ hₜ(xᵢ)
Where hₜ(xᵢ) ∈ {−1, +1}. For a correctly classified sample with high confidence, the margin is large and positive. For a misclassified sample, it is negative.
The minimum margin over all training samples:
ρ = min_i yᵢ · Σₜ λₜ hₜ(xᵢ)
2.2 Why Maximizing the Minimum Margin Matters
The generalization error of an ensemble is bounded by:
P(error) ≤ (1/m) · |{i : margin(xᵢ, yᵢ) ≤ θ}| + O(√(VC_dim / m) / θ)
For a fixed hypothesis class, maximizing θ (the minimum margin) minimizes the bound. This is the ensemble analog of SVM's margin maximization — and it provides the same theoretical motivation.
AdaBoost's implicit relationship to margins: Schapire et al. (1998) proved that AdaBoost implicitly tends to maximize margins, even without explicitly formulating a margin maximization problem. However, it does not guarantee finding the maximum margin solution.
LPBoost and TotalBoost explicitly maximize the margin — they solve the problem AdaBoost only approaches asymptotically.
3. LPBoost — Linear Programming Boosting
3.1 The Primal LP Problem
Given a set of weak classifiers {h₁, ..., hₙ} (possibly very large — all possible stumps), LPBoost seeks weights λₜ ≥ 0 that maximize the minimum margin:
Hard margin (no misclassification allowed):
maximize ρ
subject to:
yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ for all i = 1,...,m (margin constraint)
Σₜ λₜ = 1 (simplex constraint)
λₜ ≥ 0 for all t (non-negativity)
Variables: ρ ∈ ℝ (margin), λ ∈ ℝᴺ (classifier weights).
This is a linear program with m + 1 constraints (margin + normalization) and N + 1 variables (one per classifier + ρ). For N = all possible decision stumps, N can be exponentially large — but this is handled by column generation.
3.2 The Dual LP Problem
The dual of the hard-margin LP reveals the connection to AdaBoost:
Primal (in matrix form):
max ρ s.t. H^T λ ≥ ρ 1, 1^T λ = 1, λ ≥ 0
Where H is the m × N matrix with H_{it} = yᵢ hₜ(xᵢ).
Dual:
min Σᵢ dᵢ
subject to:
Σᵢ dᵢ yᵢ hₜ(xᵢ) ≤ 0 for all t (weak learner bound)
Σᵢ dᵢ = 1 (normalization)
dᵢ ≥ 0 for all i (non-negativity)
The dual variables dᵢ are sample weights — identical to AdaBoost's sample weights D(i).
The dual constraint Σᵢ dᵢ yᵢ hₜ(xᵢ) ≤ 0 says: at the optimal solution, no weak classifier hₜ achieves positive weighted margin — all selected classifiers are "used up" at the margin.
Strong duality (LP has no duality gap):
max ρ = min Σᵢ dᵢ (when both are feasible)
The primal maximum margin equals the dual minimum objective — a beautiful symmetry.
3.3 Column Generation
The LP has N variables — one per possible weak classifier. With N = all possible stumps, this is exponentially large. Column generation solves this efficiently:
Start with a small initial set of classifiers (or just the first stump).
Loop:
1. Solve the restricted LP (using only current classifiers)
2. Get current dual variables (sample weights) d* from the LP dual
3. Find the new column (classifier) that most violates the dual constraint:
hₙₑₓₜ = argmax_{h} Σᵢ dᵢ* yᵢ h(xᵢ)
(= find the weak learner with lowest weighted error rate under d*)
4. If Σᵢ dᵢ* yᵢ hₙₑₓₜ(xᵢ) > ρ* + ε: add hₙₑₓₜ to the LP and repeat
5. Else: current solution is optimal — stop
Step 3 is equivalent to training a weak learner on weighted samples — exactly what AdaBoost does! The column generation oracle is AdaBoost's weak learner oracle.
Key insight: LPBoost and AdaBoost use the same weak learner oracle (find the classifier minimizing weighted error), but differ in what they do with it:
- AdaBoost: Add the classifier with fixed weight αₜ and update sample weights heuristically
- LPBoost: Add the classifier as a new LP variable, then re-solve the LP to find globally optimal weights
3.4 The LPBoost Algorithm
Input: Training data {(xᵢ, yᵢ)}, weak learner oracle WL, tolerance ε, ν (soft margin)
Initialize:
Active set Λ = {} (no classifiers yet)
d = 1/m for all i (uniform sample weights)
Repeat:
1. Find new weak classifier:
hₙₑₓₜ = WL({(xᵢ, yᵢ, dᵢ)}) (minimize weighted error)
2. Compute edge of hₙₑₓₜ:
edge = Σᵢ dᵢ yᵢ hₙₑₓₜ(xᵢ) = 1 − 2εₙₑₓₜ
3. Add to active set:
Λ ← Λ ∪ {hₙₑₓₜ}
4. Re-solve LP over all classifiers in Λ:
(λ*, ρ*) = solve_LP(Λ)
5. Update dual sample weights d* from LP solution
6. Check stopping condition:
edge ≤ ρ* + ε → converged, stop
Output: Ensemble H(x) = sign(Σₜ λₜ* hₜ(x))
After convergence, LPBoost has found:
- The globally optimal classifier weights λₜ* for the maximum-margin ensemble
- A sparse solution — many λₜ = 0 (the LP solution is at a vertex of the feasible polytope)
3.5 Soft Margin LPBoost
The hard-margin LP fails when data is not linearly separable by any ensemble. The soft margin extension introduces slack variables ξᵢ ≥ 0:
maximize ρ − (1/νm) Σᵢ ξᵢ
subject to:
yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ − ξᵢ for all i
Σₜ λₜ = 1
λₜ ≥ 0, ξᵢ ≥ 0
The regularization parameter ν ∈ (0, 1] controls the soft-margin trade-off:
- ν → 0: Hard margin (no violations allowed)
- ν → 1: Allow many violations (focus on average performance)
- ν has an interpretation: ν is an upper bound on the fraction of margin violations and a lower bound on the fraction of support vectors
This is identical in spirit to SVM's soft margin (C parameter), with ν playing the role of 1/C.
ν ≈ 0.05: Allow 5% margin violations — good for noisy data
ν ≈ 0.5: Heavily regularized — larger margins, more misclassifications
4. TotalBoost — Totally Corrective Boosting
4.1 What Does "Totally Corrective" Mean?
"Totally corrective" refers to the re-optimization of all classifier weights after each round — as opposed to the "one step then never look back" approach of AdaBoost.
In AdaBoost:
- Round t adds classifier hₜ with weight αₜ computed from the current round's error
- Weights α₁, ..., αₜ₋₁ from previous rounds are never changed
- This is an approximation — the optimal weight for h₁ should change as new classifiers are added
In TotalBoost (and LPBoost):
- After adding each new hₜ, all weights λ₁, ..., λₜ are jointly re-optimized
- The ensemble improves at every step not just by adding new information, but by redistributing weight among all classifiers
4.2 The TotalBoost Objective
TotalBoost maximizes the minimum margin with an entropy regularization term:
maximize ρ − (1/β) · KL(λ || u)
subject to:
yᵢ · Σₜ λₜ hₜ(xᵢ) ≥ ρ for all i
Σₜ λₜ = 1, λₜ ≥ 0
Where:
KL(λ || u) = Σₜ λₜ log(λₜ / uₜ)is the KL divergence from λ to a uniform prior u = (1/T, ..., 1/T)β > 0is an inverse temperature parameter controlling the strength of regularization
The entropy term's effect:
No regularization (β → ∞): LPBoost — pure margin maximization → sparse solution
Heavy regularization (β → 0): Maximum entropy → all classifiers get equal weight (uniform)
The KL term penalizes solutions that concentrate weight on few classifiers, encouraging diverse weight distributions. This is analogous to L2 regularization (SVM's regularization) but in the space of ensemble weights rather than kernel weights.
4.3 The TotalBoost Algorithm
Input: Training data {(xᵢ, yᵢ)}, weak learner WL
Regularization β, tolerance ε
Initialize:
λ = {} (empty ensemble)
d = 1/m (uniform sample weights)
Repeat:
1. Find new weak classifier:
hₙₑₓₜ = WL({(xᵢ, yᵢ, dᵢ)})
2. Add to ensemble:
λ ← λ ∪ {λₙₑₓₜ with initial weight 0}
3. Re-optimize ALL weights jointly:
Solve: max_{λ,ρ} ρ − (1/β) KL(λ || u)
s.t. yᵢ Σₜ λₜ hₜ(xᵢ) ≥ ρ, Σλₜ=1, λₜ≥0
(This is a convex optimization with closed-form dual)
4. Update sample weights from dual:
dᵢ ∝ exp(−β · yᵢ · Σₜ λₜ* hₜ(xᵢ)) (Gibbs distribution)
5. Check convergence:
If edge of hₙₑₓₜ ≤ ρ* + ε: stop
Output: H(x) = sign(Σₜ λₜ* hₜ(x))
4.4 Entropy Regularization
The key mathematical insight of TotalBoost: when entropy regularization is added to the LPBoost objective, the dual problem has a beautiful closed form.
The Lagrangian of the primal gives dual sample weights:
dᵢ* ∝ exp(−β · yᵢ · Σₜ λₜ hₜ(xᵢ))
= exp(−β · margin(xᵢ, yᵢ))
This is a Gibbs/Boltzmann distribution over training samples, with temperature 1/β. Samples with large positive margin get exponentially down-weighted; samples near zero margin get up-weighted.
The connection to AdaBoost: When β → ∞ (no regularization) and using the forward stagewise approximation, this Gibbs distribution reduces to AdaBoost's exponential sample weights:
AdaBoost weights: wᵢ ∝ exp(−yᵢ · Σₜ αₜ hₜ(xᵢ))
TotalBoost weights: dᵢ ∝ exp(−β · yᵢ · Σₜ λₜ hₜ(xᵢ))
They are identical in form — the only difference is that TotalBoost uses globally optimal λₜ while AdaBoost uses greedy forward stagewise αₜ. AdaBoost is a greedy approximation to TotalBoost's exact optimization.
5. How AdaBoost Relates to Both
The relationship forms a clean hierarchy:
AdaBoost
├── Forward stagewise approximation to max-margin
├── Greedy: adds one classifier, fixes its weight, never re-optimizes
└── Implicitly maximizes margins (Schapire et al., 1998) — but not exactly
LPBoost
├── Exact LP formulation of the max-margin problem
├── Totally corrective: re-optimizes ALL weights after each round
└── Uses column generation — same weak learner oracle as AdaBoost
TotalBoost
├── Entropy-regularized max-margin (smoothed LPBoost)
├── Totally corrective: re-optimizes ALL weights
├── Dual sample weights are Gibbs distribution — same form as AdaBoost
└── AdaBoost = TotalBoost with greedy approximation + β → ∞
The progression: AdaBoost → TotalBoost → LPBoost represents increasing exactness of the optimization, at the cost of increasing computational complexity per round.
6. Column Generation — The Shared Mechanism
Both LPBoost and TotalBoost use column generation — a technique from combinatorial optimization for solving LP/convex programs with exponentially many variables:
Core idea:
Instead of including all N possible classifiers in the LP from the start
(which would be computationally intractable),
solve the problem iteratively by adding one column (classifier) at a time.
At each iteration:
1. Solve the current restricted problem (LP over active classifiers only)
2. Check if any inactive column (classifier) can improve the objective
→ The "pricing problem" = train a weak learner on current dual weights
3. If yes, add that classifier and re-solve
4. If no improving column exists, the current solution is optimal
Optimality certificate:
When no weak learner achieves edge > ρ* + ε,
we have a certificate that the LP objective cannot be improved
→ Global optimality (for the current hypothesis class)
Column generation is exact — it provably finds the global optimum (subject to the hypothesis class and numerical precision). This is impossible with AdaBoost's greedy approach.
Practical implication: Column generation may require far fewer classifiers than AdaBoost to achieve the same margin. LPBoost solutions are often sparse — a small number of classifiers with non-zero weight captures the optimal ensemble. AdaBoost needs many more rounds to converge to the same margin.
7. Convergence Properties
LPBoost
Under mild conditions (the weak learner oracle always finds a classifier with edge ε above the current margin), LPBoost converges to the maximum-margin solution in:
T ≤ O(1/ε² · log m) rounds
This is the same asymptotic bound as AdaBoost. However, LPBoost's totally corrective updates make the constant much smaller in practice — it typically uses 2–10× fewer rounds than AdaBoost for the same margin.
After convergence, the LP duality gap is zero — LPBoost has the optimal solution for the max-margin ensemble problem.
TotalBoost
TotalBoost converges to the entropy-regularized max-margin solution. The convergence rate depends on β:
For fixed β: Converges to KL-regularized maximum margin in O(log(m) / ε²) rounds
β → ∞: Converges to hard-margin LPBoost solution (but slower)
The entropy regularization makes TotalBoost's optimization strictly convex — guaranteeing unique solutions and smooth convergence, unlike the LP which may have non-unique optima.
8. Geometric Interpretation
LPBoost Geometry
The ensemble weights λ lie in the probability simplex Δ^N. The margin constraint defines a polyhedral feasible region. LPBoost finds the point in this polytope that maximizes the minimum margin:
Max-margin ensemble ↔ Chebyshev center of the constraint polytope
The LP solution is at a vertex of the feasible polytope — a sparse point where only a few classifiers have non-zero weight. This explains LPBoost's tendency to produce sparse ensembles.
TotalBoost Geometry
Adding the KL regularization transforms the feasible region from a polytope to a curved manifold. The optimal solution is in the interior of the simplex (not at a vertex) — all classifiers get some non-zero weight. The entropy term smoothly interpolates between the LP solution (sparse) and the uniform distribution (maximum entropy).
9. The Bias-Variance Profile
Both algorithms directly maximize margin, which is the primary driver of generalization in boosting:
| Configuration | Bias | Variance | Notes |
|---|---|---|---|
| LPBoost, hard margin | Low | Medium | Exact max-margin, may overfit noise |
| LPBoost, soft margin (ν) | Low | Low | ν controls bias-variance trade-off |
| TotalBoost, large β | Low | Medium | Approaches LPBoost |
| TotalBoost, small β | Medium | Low | Heavy entropy reg. → smoother model |
Generalization bound (margin theory):
For an ensemble achieving minimum margin ρ on training data:
P(test error) ≤ O(1 / (m · ρ²) + VC_dim · log(m) / m)
LPBoost and TotalBoost maximize ρ explicitly — they are the most principled way to achieve the smallest generalization bound among boosting algorithms.
10. Assumptions
| Assumption | LPBoost | TotalBoost |
|---|---|---|
| Weak learner consistency | Must find improving column | Must produce useful proba. |
| Linear separability (hard margin) | Required for convergence | Not required (entropy reg.) |
| IID samples | ✅ Standard | ✅ Standard |
| Finite hypothesis class | Assumed (finite N) or oracle-based | Same |
| No feature scaling | ✅ Tree-based | ✅ Tree-based |
| Clean labels (hard margin) | ✅ Very sensitive | ⚠️ Better (soft margin) |
11. Advantages
✅ Provable Optimality (LPBoost)
LPBoost produces a certified optimal solution for the maximum-margin ensemble — the best possible generalization bound for the given hypothesis class. No other boosting algorithm achieves this.
✅ Sparse Ensembles
LP solutions naturally produce sparse weight vectors — many classifiers get zero weight. Sparse ensembles are faster at prediction and easier to interpret.
✅ Totally Corrective Updates
Re-optimizing all weights after each round is theoretically superior to greedy forward stagewise. Fewer rounds needed, and the final model is globally optimized.
✅ Principled Regularization (TotalBoost)
The entropy regularization parameter β provides clean, theoretically grounded control of model complexity — analogous to SVM's C parameter but for ensembles.
✅ Connection to SVM Theory
Both algorithms are intimately connected to large-margin classification theory — the same theoretical tools that explain SVM generalization apply here.
✅ AdaBoost Connection (TotalBoost)
TotalBoost formally explains AdaBoost as a greedy approximation to an exact optimization — providing theoretical grounding for why AdaBoost works.
12. Drawbacks & Limitations
❌ Computationally Expensive Re-Optimization
The "totally corrective" re-optimization at each round requires solving an LP (LPBoost) or a convex program (TotalBoost) over all active classifiers. For T rounds and an LP with T variables, this is O(T²) LP solves — much more expensive than AdaBoost's O(T) operations.
❌ No Major Production Implementations
There is no widely maintained Python/sklearn implementation of LPBoost or TotalBoost. Research implementations exist but are not production-quality.
❌ LP Solver Dependency
LPBoost requires an LP solver (e.g., CVXPY, scipy.optimize.linprog, Gurobi). Adding a solver dependency to a production ML pipeline is often impractical.
❌ Hard Margin Fails on Noisy Data
The hard-margin LP is undefined when training data is not linearly separable by the ensemble — the LP is infeasible. Soft margin LPBoost addresses this but adds another hyperparameter (ν).
❌ Not Clearly Better in Practice Than Tuned AdaBoost
Despite provable optimality on the training margin, LPBoost and TotalBoost do not consistently outperform well-tuned AdaBoost or gradient boosting on benchmark datasets. The training margin optimality doesn't always translate to better test performance.
❌ Convergence Can Stall
Column generation may cycle or converge slowly if the LP oracle produces classifiers with very small improving edges. Careful ε thresholding is required.
13. LPBoost vs. TotalBoost vs. AdaBoost vs. GBT
| Property | LPBoost | TotalBoost | AdaBoost | GBT |
|---|---|---|---|---|
| Optimization | Exact LP | Exact convex | Greedy stagewise | Greedy stagewise |
| Weight re-optimization | ✅ All rounds | ✅ All rounds | ❌ Never | ❌ Never |
| Margin maximization | ✅ Exact | ✅ Exact (reg.) | ⚠️ Implicit | ❌ Different objective |
| Loss function | 0-1 margin | Entropy-margin | Exponential | Any differentiable |
| Sparsity | ✅ Natural (LP vertex) | ⚠️ Moderate | ❌ Uses all T | ❌ Uses all T |
| Noise robustness | ⚠️ Hard margin sensitive | ✅ Better (ent. reg.) | ❌ Poor | ✅ Good (log-loss) |
| Computational cost/round | ❌ LP solve | ❌ Convex program | ✅ O(m·p) | ✅ O(m·p) |
| Production implementations | ❌ Research only | ❌ Research only | ✅ sklearn | ✅✅ XGBoost/LGB |
| Theory (generalization) | ✅ Best margin bound | ✅ Best margin bound | ⚠️ Good | ⚠️ Loss-based |
| Practical accuracy | ⚠️ Competitive | ⚠️ Competitive | ✅ Good | ✅✅ Best tabular |
14. Practical Guidance
Implementing LPBoost with scipy
import numpy as np
from scipy.optimize import linprog
from sklearn.tree import DecisionTreeClassifier
class LPBoost:
def __init__(self, T=50, nu=0.1, epsilon=1e-4, max_depth=1):
self.T = T
self.nu = nu
self.epsilon = epsilon
self.max_depth = max_depth
self.classifiers = []
self.weights = []
def fit(self, X, y):
m = len(y)
d = np.ones(m) / m # Initial uniform weights
self.classifiers = []
self.weights = []
for t in range(self.T):
# Step 1: Train weak learner on weighted samples
clf = DecisionTreeClassifier(max_depth=self.max_depth)
clf.fit(X, y, sample_weight=d)
preds = clf.predict(X) # {-1, +1} or {0, 1}
self.classifiers.append(clf)
# Step 2: Compute margin matrix column for this classifier
if t == 0:
H = (y * preds).reshape(m, 1)
else:
H = np.column_stack([H, y * preds])
# Step 3: Solve LP using linprog
# Soft margin LP:
# max rho - (1/(nu*m)) * sum(xi)
# <=> min -rho + (1/(nu*m)) * sum(xi)
# Variables: [lambda (T), rho (1), xi (m)]
T_curr = H.shape[1]
n_vars = T_curr + 1 + m
c = np.zeros(n_vars)
c[T_curr] = -1.0 # -rho (maximize rho)
c[T_curr+1:] = 1.0/(self.nu * m) # slack penalty
# Constraint: H @ lambda >= rho - xi
# => -H @ lambda + rho*1 - xi <= 0
A_ub = np.zeros((m, n_vars))
A_ub[:, :T_curr] = -H
A_ub[:, T_curr] = 1.0
A_ub[:, T_curr+1:] = -np.eye(m)
b_ub = np.zeros(m)
# Equality: sum(lambda) = 1
A_eq = np.zeros((1, n_vars))
A_eq[0, :T_curr] = 1.0
b_eq = [1.0]
# Bounds: lambda >= 0, rho free, xi >= 0
bounds = [(0, None)] * T_curr + [(None, None)] + [(0, None)] * m
result = linprog(c, A_ub=A_ub, b_ub=b_ub,
A_eq=A_eq, b_eq=b_eq,
bounds=bounds, method='highs')
if not result.success:
break
lambda_opt = result.x[:T_curr]
rho = result.x[T_curr]
# Step 4: Compute dual weights (sample weights for next round)
# Dual of LP gives updated d
# Approximation: use margin-based reweighting
margins = H @ lambda_opt
d = np.maximum(rho - margins, 0)
d = d / d.sum() if d.sum() > 0 else np.ones(m)/m
# Check convergence: edge of new classifier
edge = np.dot(d, y * preds)
if edge <= rho + self.epsilon:
break
self.weights = lambda_opt
self.rho = rho
return self
def predict(self, X):
preds = np.zeros(len(X))
for clf, w in zip(self.classifiers, self.weights):
preds += w * clf.predict(X)
return np.sign(preds)
# Usage
clf = LPBoost(T=50, nu=0.1)
clf.fit(X_train, y_train)
print(f"Number of non-zero classifiers: {(clf.weights > 1e-6).sum()}")
Implementing TotalBoost (Entropy Regularization)
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class TotalBoost:
def __init__(self, T=100, beta=10.0, epsilon=1e-5, max_depth=1):
self.T = T
self.beta = beta
self.epsilon = epsilon
self.max_depth = max_depth
self.classifiers = []
self.weights = []
def fit(self, X, y):
m = len(y)
d = np.ones(m) / m
self.classifiers = []
H = np.zeros((m, 0)) # Margin matrix
for t in range(self.T):
# Train weak learner
clf = DecisionTreeClassifier(max_depth=self.max_depth)
clf.fit(X, y, sample_weight=d)
preds = clf.predict(X)
self.classifiers.append(clf)
# Add column to H
H = np.column_stack([H, y * preds]) if H.size else (y * preds).reshape(-1, 1)
# Entropy-regularized optimization (approximate):
# Update lambda proportional to exp(-beta * margin)
# This is the TotalBoost update in closed-form approximation
T_curr = H.shape[1]
lambda_curr = np.ones(T_curr) / T_curr # Start uniform
# Iterative projection (Sinkhorn-like)
for _ in range(50):
margins = H @ lambda_curr
d_new = np.exp(-self.beta * margins)
d_new /= d_new.sum()
# Recompute lambda via dual relationship
# lambda_t proportional to sum_i d_i * H_{it}
scores = H.T @ d_new
lambda_new = np.maximum(scores, 0)
if lambda_new.sum() > 0:
lambda_new /= lambda_new.sum()
else:
lambda_new = np.ones(T_curr) / T_curr
if np.max(np.abs(lambda_new - lambda_curr)) < 1e-6:
break
lambda_curr = lambda_new
d = d_new
self.weights = lambda_curr
# Check convergence
edge = H[:, -1] @ d
rho = np.min(H @ self.weights)
if edge <= rho + self.epsilon:
break
return self
def predict(self, X):
preds = np.zeros(len(X))
for clf, w in zip(self.classifiers, self.weights):
preds += w * clf.predict(X)
return np.sign(preds)
clf = TotalBoost(T=100, beta=10.0)
clf.fit(X_train, y_train)
Using CVXPY for Cleaner LPBoost
import cvxpy as cp
import numpy as np
def solve_lpboost(H, nu):
"""
H: (m, T) matrix — H[i,t] = y_i * h_t(x_i)
nu: soft margin parameter
Returns: (lambda, rho)
"""
m, T = H.shape
lam = cp.Variable(T, nonneg=True)
rho = cp.Variable()
xi = cp.Variable(m, nonneg=True)
objective = cp.Maximize(rho - (1/(nu*m)) * cp.sum(xi))
constraints = [
H @ lam >= rho - xi, # Margin constraints with slack
cp.sum(lam) == 1, # Simplex constraint
]
prob = cp.Problem(objective, constraints)
prob.solve(solver=cp.ECOS)
return lam.value, rho.value
15. When to Use Them
Use LPBoost when:
- You need a provably optimal ensemble for the max-margin problem
- Sparse ensembles are required for fast inference or interpretability
- You have access to an LP solver (scipy, CVXPY, Gurobi)
- Dataset is small to medium (LP re-solve cost is O(T²))
- You are doing research on boosting theory or margin maximization
- You want to understand the theoretical optimality of boosting methods
Use TotalBoost when:
- You want the AdaBoost connection made explicit and precise
- Entropy regularization provides better generalization than hard-margin LP
- You need a smoother optimization landscape than LP
- You are studying the theoretical relationship between AdaBoost and max-margin
Use AdaBoost instead for:
- Production binary classification without an LP solver dependency
- Fast training — O(m·p) per round vs. LPBoost's LP solve
Use Gradient Boosting (XGBoost/LightGBM) instead for:
- Maximum accuracy on tabular data
- Large datasets where LP re-solve is infeasible
- Regression or multi-class problems at scale
Summary
┌──────────────────────────────────────────────────────────────────────┐
│ LPBOOST & TOTALBOOST AT A GLANCE │
├──────────────────────────────────────────────────────────────────────┤
│ SHARED GOAL Maximize minimum margin over training data │
│ SHARED MECH Column generation = train weak learner on d* │
│ SHARED PROP Totally corrective: re-optimize ALL weights/round │
│ │
│ LPBOOST Exact LP: max ρ s.t. H·λ ≥ ρ, Σλ=1, λ≥0 │
│ LP RESULT Sparse weights; zero duality gap; certified optimal │
│ SOFT MARGIN + ξ slack + ν parameter for noise tolerance │
│ │
│ TOTALBOOST Entropy-reg: max ρ − (1/β)·KL(λ||u) │
│ GIBBS WEIGHTS dᵢ ∝ exp(−β · margin_i) [Boltzmann distribution] │
│ ADABOOST LINK AdaBoost = TotalBoost with greedy approx + β→∞ │
│ │
│ STRENGTH Optimality, sparsity, theory, margin certificate │
│ WEAKNESS LP solver needed, slow, no production impl. │
│ BEST FOR Research, small data, sparse ensembles │
└──────────────────────────────────────────────────────────────────────┘
LPBoost and TotalBoost represent the mathematical completion of the boosting program. Where AdaBoost approximated, they optimize. Where AdaBoost guessed at weights, they solve for them. Where AdaBoost drifted toward a maximum-margin solution, they arrive there directly. The price is computational — every round requires solving a program instead of just training a stump — but the reward is a provable certificate: the ensemble you have is the best possible ensemble for the given hypothesis class. That guarantee is rare in machine learning, and for problems where theoretical rigor matters as much as empirical performance, LPBoost and TotalBoost offer something that neither AdaBoost nor gradient boosting can: proof.