Logistic Regression

Logistic Regression — Deep Analysis

1. What Is Logistic Regression?

Logistic Regression is a supervised machine learning algorithm used for Classification tasks — despite the word regression in its name. The name is historical: it uses a regression-like linear combination of inputs, but passes the result through a non-linear function to produce a probability between 0 and 1.

It answers questions like:

Is this email spam or not spam?
Will this customer churn or stay?
Is this tumor malignant or benign?

Key identity: Logistic Regression is a discriminative, probabilistic, linear classifier.

Property	Value
Type	Supervised Learning
Task	Classification (binary or multi)
Output	Probability → class label
Decision surface	Linear
Parametric	Yes

2. The Core Problem It Solves

Linear regression predicts a continuous value, such as house prices. But Classification needs a bounded output — specifically a value between 0 and 1 that can be interpreted as a probability.

If you naively apply linear regression to a binary label (0 or 1):

ŷ = w₀ + w₁x₁ + w₂x₂ + ...

The output can be any real number (e.g., −4.7 or 13.2), which is meaningless as a probability. You also risk:

Predictions outside [0, 1]
A model that is highly sensitive to outliers
Violation of the core probabilistic interpretation

Logistic Regression solves this by squashing the linear output into the (0, 1) range using the sigmoid function.

3. Mathematical Foundation

3.1 The Sigmoid Function

The sigmoid (logistic) function is:

σ(z) = 1 / (1 + e^(−z))

Where z is the linear combination of inputs:

z = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ
  = wᵀx  (in vector notation)

Properties of σ(z):

Input z	σ(z) value	Interpretation
z → +∞	σ(z) → 1	Confident positive class
z = 0	σ(z) = 0.5	Perfect uncertainty
z → −∞	σ(z) → 0	Confident negative class

The curve is S-shaped, continuous, and differentiable — critical properties for gradient-based optimization.

Derivative of sigmoid (crucial for backprop):

σ'(z) = σ(z) · (1 − σ(z))

This elegant self-referential derivative makes the math clean and efficient.

3.2 The Decision Boundary

The model outputs a probability. To make a class prediction, we apply a threshold (typically 0.5):

ŷ = 1  if  σ(wᵀx) ≥ 0.5
ŷ = 0  if  σ(wᵀx) < 0.5

Since σ(z) = 0.5 when z = 0, the decision boundary is where:

wᵀx = 0

This is a hyperplane in the feature space — a line in 2D, a plane in 3D, and so on. This is why logistic regression is a linear classifier: it can only separate classes with a straight line/plane.

3.3 Probability Interpretation

The full probabilistic model reads:

P(y=1 | x; w) = σ(wᵀx)
P(y=0 | x; w) = 1 − σ(wᵀx)

Compactly, for label y ∈ {0, 1}:

P(y | x; w) = σ(wᵀx)^y · (1 − σ(wᵀx))^(1−y)

This is the Bernoulli likelihood — the statistical engine that powers the loss function.

4. How It Works — Step by Step

Here's the full forward pass, walking through a concrete example.

Given: A patient's age (x₁ = 45) and cholesterol (x₂ = 230). Predict heart disease (1) or not (0).

Step 1 — Compute the linear score:

z = w₀ + w₁·45 + w₂·230
  = −8.5 + 0.07·45 + 0.03·230
  = −8.5 + 3.15 + 6.9
  = 1.55

Step 2 — Apply sigmoid:

σ(1.55) = 1 / (1 + e^(−1.55))
         = 1 / (1 + 0.212)
         ≈ 0.825

Step 3 — Interpret as probability:

P(heart disease | patient data) ≈ 82.5%

Step 4 — Apply threshold:

0.825 ≥ 0.5  →  Predict: Heart Disease = YES

The threshold is a hyperparameter you can tune (e.g., set to 0.3 for high-recall scenarios in medical diagnosis).

5. How It Is Trained

5.1 Loss Function: Binary Cross-Entropy

We cannot use Mean Squared Error (MSE) for logistic regression because:

The sigmoid makes the error surface non-convex under MSE
Gradient descent can get stuck in local minima
MSE doesn't align with the probabilistic interpretation

Instead, we use Binary Cross-Entropy Loss (also called Log Loss):

L(w) = −(1/m) Σᵢ [ yᵢ · log(ŷᵢ) + (1 − yᵢ) · log(1 − ŷᵢ) ]

Where:

m = number of training samples
yᵢ = true label for sample i
ŷᵢ = predicted probability for sample i

Intuition of each term:

True label yᵢ	Active term	Behavior
1	−log(ŷᵢ)	Penalizes predicting low prob for a positive
0	−log(1 − ŷᵢ)	Penalizes predicting high prob for a negative

Why log? The logarithm converts the product of probabilities (from the Bernoulli likelihood) into a sum — a numerically stable, convex function. The result is a convex loss surface with a single global minimum.

Cross-entropy per sample:

If y=1 and ŷ=0.9  →  Loss = −log(0.9) ≈ 0.105  ✅ Low penalty
If y=1 and ŷ=0.1  →  Loss = −log(0.1) ≈ 2.303  ❌ High penalty

5.2 Gradient Descent

Since there's no closed-form solution, we use iterative optimization. The gradient of the cross-entropy loss with respect to weights is:

∂L/∂w = (1/m) · Xᵀ · (ŷ − y)

Where (ŷ − y) is the vector of prediction errors. Notice this has the same form as linear regression's gradient — one of the beautiful symmetries of these models.

Three Variants:

Variant	Update Frequency	Pros	Cons
Batch Gradient Descent	Once per full dataset	Stable, exact gradient	Slow on large datasets
Stochastic Gradient Descent	Once per sample	Fast, online learning	Noisy, oscillating loss
Mini-Batch GD	Once per batch	Best of both worlds	Requires tuning batch size

5.3 The Update Rule

At each iteration, weights are updated:

w := w − α · ∂L/∂w

Where α (alpha) is the learning rate — a critical hyperparameter:

Too high: Overshoots the minimum, diverges
Too low: Converges too slowly
Just right: Smooth convergence to the global minimum

Full training loop pseudocode:

initialize weights w = 0  (or random small values)

for epoch in range(num_epochs):
    z = X @ w                        # Linear combination
    ŷ = sigmoid(z)                   # Apply sigmoid
    error = ŷ - y                    # Compute error
    gradient = (1/m) * X.T @ error   # Compute gradient
    w = w - alpha * gradient         # Update weights

    loss = cross_entropy(y, ŷ)       # Track loss

Convergence is detected when |L(wₜ₊₁) − L(wₜ)| < ε for some small tolerance ε.

Advanced optimizers like Adam, RMSProp, and L-BFGS can dramatically speed up convergence compared to vanilla gradient descent.

6. Multiclass Extension

Standard logistic regression handles binary problems. For K > 2 classes, two strategies exist:

One-vs-Rest (OvR)

Train K separate binary classifiers, one per class against all others. Assign the class with the highest predicted probability.

Classifier 1: "Is it Class A?" (vs. B, C, D)
Classifier 2: "Is it Class B?" (vs. A, C, D)
...

Limitation: Probabilities from each classifier don't sum to 1 and can overlap.

Softmax (Multinomial Logistic Regression)

A natural generalization using the softmax function:

P(y=k | x) = e^(wₖᵀx) / Σⱼ e^(wⱼᵀx)

All class probabilities sum to exactly 1. This is the principled approach and is used in neural networks' output layers.

7. Regularization

Logistic regression is prone to overfitting, especially with many features or correlated predictors. Regularization adds a penalty term to the loss:

L2 Regularization (Ridge)

L_reg(w) = L(w) + λ · Σ wⱼ²

Shrinks all weights toward zero
Never sets them exactly to zero
Handles multicollinearity well

L1 Regularization (Lasso)

L_reg(w) = L(w) + λ · Σ |wⱼ|

Can drive weights exactly to zero → automatic feature selection
Creates sparse models
Useful when you suspect many features are irrelevant

Elastic Net

L_reg(w) = L(w) + λ₁ · Σ|wⱼ| + λ₂ · Σwⱼ²

Combines both — the best of L1 and L2.

λ is the regularization strength:

λ = 0: No regularization (risk of overfitting)
λ → ∞: All weights → 0 (risk of underfitting)

8. Assumptions of the Model

Logistic regression makes several assumptions that, when violated, degrade performance:

Assumption	Description
Linear decision boundary	Features and log-odds must be linearly related
Independence of observations	Samples should not be correlated (e.g., time series violates this)
Little or no multicollinearity	Highly correlated features inflate variance and destabilize coefficients
Large sample size	MLE is asymptotically consistent — small samples give unreliable estimates
No extreme outliers	Outliers can disproportionately influence the decision boundary
Binary (or ordinal) dependent var	Assumes the output is categorical, not continuous

9. Evaluation Metrics

Accuracy alone is insufficient — especially with imbalanced classes.

Confusion Matrix

              Predicted Positive   Predicted Negative
Actual Positive     TP                   FN
Actual Negative     FP                   TN

Derived Metrics

Accuracy    = (TP + TN) / (TP + TN + FP + FN)
[[Precision]]   = TP / (TP + FP)               # Of all positives predicted, how many were correct?
Recall      = TP / (TP + FN)               # Of all actual positives, how many did we catch?
[[F1-Score]]    = 2 · ([[Precision]] · Recall) / ([[Precision]] + Recall)

ROC-AUC

The Receiver Operating Characteristic curve plots True Positive Rate vs. False Positive Rate at all thresholds. The Area Under the Curve (AUC):

AUC = 1.0 → Perfect classifier
AUC = 0.5 → Random guessing
AUC < 0.5 → Worse than random

Log Loss

Measures the quality of probability estimates (not just the hard class labels):

Log Loss = −(1/m) Σ [ yᵢ·log(p̂ᵢ) + (1−yᵢ)·log(1−p̂ᵢ) ]

Lower is better. A model with log loss = 0 is perfect.

10. Advantages

✅ Probabilistic Output

Returns calibrated probabilities, not just labels. This is critical in medicine, finance, and risk assessment where how confident the model is matters.

✅ Highly Interpretable

Coefficients directly encode the relationship between features and the log-odds of the outcome:

log(P/(1−P)) = w₀ + w₁x₁ + ...

Each wⱼ tells you: "A one-unit increase in xⱼ multiplies the odds by eˢʷʲ."

✅ No Distributional Assumption on Features

Unlike Linear Discriminant Analysis (LDA), logistic regression does not assume features are normally distributed.

✅ Computationally Efficient

Training is fast even on large datasets. Convexity guarantees convergence to the global optimum.

✅ Robust to Small Datasets

Performs surprisingly well with limited data, especially compared to complex models.

✅ Excellent Baseline

Always use logistic regression as a baseline before trying complex models. If a neural network only marginally beats it, the added complexity may not be worth it.

✅ Handles Regularization Naturally

L1/L2 penalties are trivially added and well-studied.

✅ Scales Well

Works well with stochastic gradient descent on very large datasets (online learning).

11. Drawbacks & Limitations

❌ Linearity Constraint

The most fundamental limitation. Logistic regression cannot learn non-linear decision boundaries without explicit feature engineering (polynomial features, interaction terms, etc.).

If your data looks like two concentric rings, logistic regression will fail — it can only draw a straight line.

❌ Feature Engineering Required

To capture complex patterns, you must manually create new features (e.g., x₁², x₁·x₂). This requires domain expertise and is labor-intensive.

❌ Poor with Many Irrelevant Features

Performance degrades without feature selection or strong regularization when many features are noise.

❌ Multicollinearity Sensitivity

Highly correlated features cause coefficients to become unstable and hard to interpret.

❌ Class Imbalance Sensitivity

With severely imbalanced classes (e.g., 99% negative, 1% positive), the model may simply predict "negative" for everything and still achieve 99% Accuracy. Must use:

Class weighting
Resampling (SMOTE, oversampling)
Threshold tuning

❌ Not Ideal for Complex Relationships

In domains with rich, high-dimensional, non-linear interactions (images, text, audio), deep learning will vastly outperform logistic regression.

❌ Assumes No Perfect Separation

If classes are perfectly linearly separable, Maximum Likelihood Estimation fails to converge (coefficients go to ±∞). This is called complete separation and requires regularization to fix.

12. Logistic Regression vs. Linear Regression

Property	Linear Regression	Logistic Regression
Task	Regression	Classification
Output	Unbounded real value	Probability in (0, 1)
Activation	None (identity)	Sigmoid
Loss Function	MSE	Binary Cross-Entropy
Solution	Closed-form or GD	Gradient Descent (no closed form)
Interpretation	Direct effect on output	Effect on log-odds
Assumptions	Linearity, normality of errors	Linear log-odds, independence

13. Logistic Regression vs. Other Classifiers

Classifier	Non-linear	Probabilistic	Interpretable	Scalable	Overfitting Risk
Logistic Regression	❌	✅	✅	✅	Low
SVM	✅ (kernel)	❌ (margins)	⚠️	⚠️	Medium
Decision Tree	✅	⚠️	✅	✅	High
Random Forest	✅	✅	❌	✅	Low
Naive Bayes	❌	✅	✅	✅	Low
Neural Network	✅	✅	❌	✅	Very High
k-NN	✅	⚠️	⚠️	❌	Medium

Note: Logistic Regression is often the sweet spot of interpretability + performance for linearly separable problems.

14. Practical Tips & Gotchas

Feature Scaling Is Essential

Logistic regression is not scale-invariant. Always standardize features:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Without scaling, features with large ranges dominate the weights, and gradient descent converges slowly.

Handling Imbalanced Data

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')

Or use class_weight={0: 1, 1: 10} to manually upweight the minority class.

Adding Non-linearity

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)

This creates features like x₁², x₁·x₂, enabling curved decision boundaries.

Choosing the Regularization Hyperparameter

Use cross-validation over a grid:

from sklearn.model_selection import GridSearchCV
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(), params, cv=5, scoring='roc_auc')
grid.fit(X_train, y_train)

Note: In sklearn, C = 1/λ — higher C means less regularization.

Watch for Complete Separation

Symptoms: Extremely large coefficients, warning messages from the solver. Fix: Add L2 regularization (penalty='l2').

Solver Selection (sklearn)

Solver	Best For
`lbfgs`	Small/medium datasets, L2
`saga`	Large datasets, L1 or Elastic Net
`liblinear`	Small datasets, L1
`newton-cg`	Large dense data, L2

15. When to Use It

Use Logistic Regression when:

You need interpretability — regulators, doctors, or stakeholders need to understand predictions
Your classes are approximately linearly separable
You need a probability estimate, not just a hard label
You have a limited dataset and want to avoid overfitting
You need a fast baseline to compare against complex models
You're doing online learning or streaming data
Features are mostly relevant and not too correlated

Do NOT use Logistic Regression when:

Data has complex non-linear structure (use tree-based models or neural networks)
You have raw images, text, or audio without feature engineering
You have extremely high-dimensional sparse data with complex interactions
[[Accuracy]] is paramount and interpretability doesn't matter (use ensembles)

Summary

┌─────────────────────────────────────────────────────────┐
│              LOGISTIC REGRESSION AT A GLANCE            │
├─────────────────────────────────────────────────────────┤
│  CORE IDEA     Linear model + sigmoid → probability     │
│  TRAINING      Minimize cross-entropy via gradient desc  │
│  DECISION      Hyperplane: wᵀx = 0                      │
│  OUTPUT        P(y=1|x) ∈ (0, 1)                        │
│  STRENGTHS     Interpretable, fast, probabilistic        │
│  WEAKNESSES    Linear boundary, needs feature eng.       │
│  BEST FOR      Binary [[Classification]], baseline models    │
└─────────────────────────────────────────────────────────┘

Logistic Regression is not just a stepping stone to "real" ML — it is a production-grade algorithm used in credit scoring, medical diagnosis, ad-click prediction, and fraud detection at massive scale. Understanding it deeply means understanding the building blocks of neural networks, probabilistic modeling, and maximum likelihood estimation. Master this, and the rest of machine learning becomes clearer.

End of document. Total depth: fundamentals → math → training → tuning → evaluation → comparison → practical usage.