Logistic Regression

Logistic Regression — Deep Analysis


1. What Is Logistic Regression?

Logistic Regression is a supervised machine learning algorithm used for Classification tasks — despite the word regression in its name. The name is historical: it uses a regression-like linear combination of inputs, but passes the result through a non-linear function to produce a probability between 0 and 1.

It answers questions like:

Key identity: Logistic Regression is a discriminative, probabilistic, linear classifier.

Property Value
Type Supervised Learning
Task Classification (binary or multi)
Output Probability → class label
Decision surface Linear
Parametric Yes

2. The Core Problem It Solves

Linear regression predicts a continuous value, such as house prices. But Classification needs a bounded output — specifically a value between 0 and 1 that can be interpreted as a probability.

If you naively apply linear regression to a binary label (0 or 1):

ŷ = w₀ + w₁x₁ + w₂x₂ + ...

The output can be any real number (e.g., −4.7 or 13.2), which is meaningless as a probability. You also risk:

Logistic Regression solves this by squashing the linear output into the (0, 1) range using the sigmoid function.


3. Mathematical Foundation

3.1 The Sigmoid Function

The sigmoid (logistic) function is:

σ(z) = 1 / (1 + e^(−z))

Where z is the linear combination of inputs:

z = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ
  = wᵀx  (in vector notation)

Properties of σ(z):

Input z σ(z) value Interpretation
z → +∞ σ(z) → 1 Confident positive class
z = 0 σ(z) = 0.5 Perfect uncertainty
z → −∞ σ(z) → 0 Confident negative class

The curve is S-shaped, continuous, and differentiable — critical properties for gradient-based optimization.

Derivative of sigmoid (crucial for backprop):

σ'(z) = σ(z) · (1 − σ(z))

This elegant self-referential derivative makes the math clean and efficient.


3.2 The Decision Boundary

The model outputs a probability. To make a class prediction, we apply a threshold (typically 0.5):

ŷ = 1  if  σ(wᵀx) ≥ 0.5
ŷ = 0  if  σ(wᵀx) < 0.5

Since σ(z) = 0.5 when z = 0, the decision boundary is where:

wᵀx = 0

This is a hyperplane in the feature space — a line in 2D, a plane in 3D, and so on. This is why logistic regression is a linear classifier: it can only separate classes with a straight line/plane.


3.3 Probability Interpretation

The full probabilistic model reads:

P(y=1 | x; w) = σ(wᵀx)
P(y=0 | x; w) = 1 − σ(wᵀx)

Compactly, for label y ∈ {0, 1}:

P(y | x; w) = σ(wᵀx)^y · (1 − σ(wᵀx))^(1−y)

This is the Bernoulli likelihood — the statistical engine that powers the loss function.


4. How It Works — Step by Step

Here's the full forward pass, walking through a concrete example.

Given: A patient's age (x₁ = 45) and cholesterol (x₂ = 230). Predict heart disease (1) or not (0).

Step 1 — Compute the linear score:

z = w₀ + w₁·45 + w₂·230
  = −8.5 + 0.07·45 + 0.03·230
  = −8.5 + 3.15 + 6.9
  = 1.55

Step 2 — Apply sigmoid:

σ(1.55) = 1 / (1 + e^(−1.55))
         = 1 / (1 + 0.212)
         ≈ 0.825

Step 3 — Interpret as probability:

P(heart disease | patient data) ≈ 82.5%

Step 4 — Apply threshold:

0.825 ≥ 0.5  →  Predict: Heart Disease = YES

The threshold is a hyperparameter you can tune (e.g., set to 0.3 for high-recall scenarios in medical diagnosis).


5. How It Is Trained

5.1 Loss Function: Binary Cross-Entropy

We cannot use Mean Squared Error (MSE) for logistic regression because:

Instead, we use Binary Cross-Entropy Loss (also called Log Loss):

L(w) = −(1/m) Σᵢ [ yᵢ · log(ŷᵢ) + (1 − yᵢ) · log(1 − ŷᵢ) ]

Where:

Intuition of each term:

True label yᵢ Active term Behavior
1 −log(ŷᵢ) Penalizes predicting low prob for a positive
0 −log(1 − ŷᵢ) Penalizes predicting high prob for a negative

Why log? The logarithm converts the product of probabilities (from the Bernoulli likelihood) into a sum — a numerically stable, convex function. The result is a convex loss surface with a single global minimum.

Cross-entropy per sample:

If y=1 and ŷ=0.9  →  Loss = −log(0.9) ≈ 0.105  ✅ Low penalty
If y=1 and ŷ=0.1  →  Loss = −log(0.1) ≈ 2.303  ❌ High penalty

5.2 Gradient Descent

Since there's no closed-form solution, we use iterative optimization. The gradient of the cross-entropy loss with respect to weights is:

∂L/∂w = (1/m) · Xᵀ · (ŷ − y)

Where (ŷ − y) is the vector of prediction errors. Notice this has the same form as linear regression's gradient — one of the beautiful symmetries of these models.

Three Variants:

Variant Update Frequency Pros Cons
Batch Gradient Descent Once per full dataset Stable, exact gradient Slow on large datasets
Stochastic Gradient Descent Once per sample Fast, online learning Noisy, oscillating loss
Mini-Batch GD Once per batch Best of both worlds Requires tuning batch size

5.3 The Update Rule

At each iteration, weights are updated:

w := w − α · ∂L/∂w

Where α (alpha) is the learning rate — a critical hyperparameter:

Full training loop pseudocode:

initialize weights w = 0  (or random small values)

for epoch in range(num_epochs):
    z = X @ w                        # Linear combination
    ŷ = sigmoid(z)                   # Apply sigmoid
    error = ŷ - y                    # Compute error
    gradient = (1/m) * X.T @ error   # Compute gradient
    w = w - alpha * gradient         # Update weights

    loss = cross_entropy(y, ŷ)       # Track loss

Convergence is detected when |L(wₜ₊₁) − L(wₜ)| < ε for some small tolerance ε.

Advanced optimizers like Adam, RMSProp, and L-BFGS can dramatically speed up convergence compared to vanilla gradient descent.


6. Multiclass Extension

Standard logistic regression handles binary problems. For K > 2 classes, two strategies exist:

One-vs-Rest (OvR)

Train K separate binary classifiers, one per class against all others. Assign the class with the highest predicted probability.

Classifier 1: "Is it Class A?" (vs. B, C, D)
Classifier 2: "Is it Class B?" (vs. A, C, D)
...

Limitation: Probabilities from each classifier don't sum to 1 and can overlap.

Softmax (Multinomial Logistic Regression)

A natural generalization using the softmax function:

P(y=k | x) = e^(wₖᵀx) / Σⱼ e^(wⱼᵀx)

All class probabilities sum to exactly 1. This is the principled approach and is used in neural networks' output layers.


7. Regularization

Logistic regression is prone to overfitting, especially with many features or correlated predictors. Regularization adds a penalty term to the loss:

L2 Regularization (Ridge)

L_reg(w) = L(w) + λ · Σ wⱼ²

L1 Regularization (Lasso)

L_reg(w) = L(w) + λ · Σ |wⱼ|

Elastic Net

L_reg(w) = L(w) + λ₁ · Σ|wⱼ| + λ₂ · Σwⱼ²

Combines both — the best of L1 and L2.

λ is the regularization strength:


8. Assumptions of the Model

Logistic regression makes several assumptions that, when violated, degrade performance:

Assumption Description
Linear decision boundary Features and log-odds must be linearly related
Independence of observations Samples should not be correlated (e.g., time series violates this)
Little or no multicollinearity Highly correlated features inflate variance and destabilize coefficients
Large sample size MLE is asymptotically consistent — small samples give unreliable estimates
No extreme outliers Outliers can disproportionately influence the decision boundary
Binary (or ordinal) dependent var Assumes the output is categorical, not continuous

9. Evaluation Metrics

Accuracy alone is insufficient — especially with imbalanced classes.

Confusion Matrix

              Predicted Positive   Predicted Negative
Actual Positive     TP                   FN
Actual Negative     FP                   TN

Derived Metrics

Accuracy    = (TP + TN) / (TP + TN + FP + FN)
[[Precision]]   = TP / (TP + FP)               # Of all positives predicted, how many were correct?
Recall      = TP / (TP + FN)               # Of all actual positives, how many did we catch?
[[F1-Score]]    = 2 · ([[Precision]] · Recall) / ([[Precision]] + Recall)

ROC-AUC

The Receiver Operating Characteristic curve plots True Positive Rate vs. False Positive Rate at all thresholds. The Area Under the Curve (AUC):

Log Loss

Measures the quality of probability estimates (not just the hard class labels):

Log Loss = −(1/m) Σ [ yᵢ·log(p̂ᵢ) + (1−yᵢ)·log(1−p̂ᵢ) ]

Lower is better. A model with log loss = 0 is perfect.


10. Advantages

✅ Probabilistic Output

Returns calibrated probabilities, not just labels. This is critical in medicine, finance, and risk assessment where how confident the model is matters.

✅ Highly Interpretable

Coefficients directly encode the relationship between features and the log-odds of the outcome:

log(P/(1−P)) = w₀ + w₁x₁ + ...

Each wⱼ tells you: "A one-unit increase in xⱼ multiplies the odds by eˢʷʲ."

✅ No Distributional Assumption on Features

Unlike Linear Discriminant Analysis (LDA), logistic regression does not assume features are normally distributed.

✅ Computationally Efficient

Training is fast even on large datasets. Convexity guarantees convergence to the global optimum.

✅ Robust to Small Datasets

Performs surprisingly well with limited data, especially compared to complex models.

✅ Excellent Baseline

Always use logistic regression as a baseline before trying complex models. If a neural network only marginally beats it, the added complexity may not be worth it.

✅ Handles Regularization Naturally

L1/L2 penalties are trivially added and well-studied.

✅ Scales Well

Works well with stochastic gradient descent on very large datasets (online learning).


11. Drawbacks & Limitations

❌ Linearity Constraint

The most fundamental limitation. Logistic regression cannot learn non-linear decision boundaries without explicit feature engineering (polynomial features, interaction terms, etc.).

If your data looks like two concentric rings, logistic regression will fail — it can only draw a straight line.

❌ Feature Engineering Required

To capture complex patterns, you must manually create new features (e.g., x₁², x₁·x₂). This requires domain expertise and is labor-intensive.

❌ Poor with Many Irrelevant Features

Performance degrades without feature selection or strong regularization when many features are noise.

❌ Multicollinearity Sensitivity

Highly correlated features cause coefficients to become unstable and hard to interpret.

❌ Class Imbalance Sensitivity

With severely imbalanced classes (e.g., 99% negative, 1% positive), the model may simply predict "negative" for everything and still achieve 99% Accuracy. Must use:

❌ Not Ideal for Complex Relationships

In domains with rich, high-dimensional, non-linear interactions (images, text, audio), deep learning will vastly outperform logistic regression.

❌ Assumes No Perfect Separation

If classes are perfectly linearly separable, Maximum Likelihood Estimation fails to converge (coefficients go to ±∞). This is called complete separation and requires regularization to fix.


12. Logistic Regression vs. Linear Regression

Property Linear Regression Logistic Regression
Task Regression Classification
Output Unbounded real value Probability in (0, 1)
Activation None (identity) Sigmoid
Loss Function MSE Binary Cross-Entropy
Solution Closed-form or GD Gradient Descent (no closed form)
Interpretation Direct effect on output Effect on log-odds
Assumptions Linearity, normality of errors Linear log-odds, independence

13. Logistic Regression vs. Other Classifiers

Classifier Non-linear Probabilistic Interpretable Scalable Overfitting Risk
Logistic Regression Low
SVM ✅ (kernel) ❌ (margins) ⚠️ ⚠️ Medium
Decision Tree ⚠️ High
Random Forest Low
Naive Bayes Low
Neural Network Very High
k-NN ⚠️ ⚠️ Medium

Note: Logistic Regression is often the sweet spot of interpretability + performance for linearly separable problems.


14. Practical Tips & Gotchas

Feature Scaling Is Essential

Logistic regression is not scale-invariant. Always standardize features:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Without scaling, features with large ranges dominate the weights, and gradient descent converges slowly.

Handling Imbalanced Data

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')

Or use class_weight={0: 1, 1: 10} to manually upweight the minority class.

Adding Non-linearity

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)

This creates features like x₁², x₁·x₂, enabling curved decision boundaries.

Choosing the Regularization Hyperparameter

Use cross-validation over a grid:

from sklearn.model_selection import GridSearchCV
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(), params, cv=5, scoring='roc_auc')
grid.fit(X_train, y_train)

Note: In sklearn, C = 1/λhigher C means less regularization.

Watch for Complete Separation

Symptoms: Extremely large coefficients, warning messages from the solver. Fix: Add L2 regularization (penalty='l2').

Solver Selection (sklearn)

Solver Best For
lbfgs Small/medium datasets, L2
saga Large datasets, L1 or Elastic Net
liblinear Small datasets, L1
newton-cg Large dense data, L2

15. When to Use It

Use Logistic Regression when:

Do NOT use Logistic Regression when:


Summary

┌─────────────────────────────────────────────────────────┐
│              LOGISTIC REGRESSION AT A GLANCE            │
├─────────────────────────────────────────────────────────┤
│  CORE IDEA     Linear model + sigmoid → probability     │
│  TRAINING      Minimize cross-entropy via gradient desc  │
│  DECISION      Hyperplane: wᵀx = 0                      │
│  OUTPUT        P(y=1|x) ∈ (0, 1)                        │
│  STRENGTHS     Interpretable, fast, probabilistic        │
│  WEAKNESSES    Linear boundary, needs feature eng.       │
│  BEST FOR      Binary [[Classification]], baseline models    │
└─────────────────────────────────────────────────────────┘

Logistic Regression is not just a stepping stone to "real" ML — it is a production-grade algorithm used in credit scoring, medical diagnosis, ad-click prediction, and fraud detection at massive scale. Understanding it deeply means understanding the building blocks of neural networks, probabilistic modeling, and maximum likelihood estimation. Master this, and the rest of machine learning becomes clearer.


End of document. Total depth: fundamentals → math → training → tuning → evaluation → comparison → practical usage.

Powered by Forestry.md