Logistic Regression
Logistic Regression — Deep Analysis
1. What Is Logistic Regression?
Logistic Regression is a supervised machine learning algorithm used for Classification tasks — despite the word regression in its name. The name is historical: it uses a regression-like linear combination of inputs, but passes the result through a non-linear function to produce a probability between 0 and 1.
It answers questions like:
- Is this email spam or not spam?
- Will this customer churn or stay?
- Is this tumor malignant or benign?
Key identity: Logistic Regression is a discriminative, probabilistic, linear classifier.
| Property | Value |
|---|---|
| Type | Supervised Learning |
| Task | Classification (binary or multi) |
| Output | Probability → class label |
| Decision surface | Linear |
| Parametric | Yes |
2. The Core Problem It Solves
Linear regression predicts a continuous value, such as house prices. But Classification needs a bounded output — specifically a value between 0 and 1 that can be interpreted as a probability.
If you naively apply linear regression to a binary label (0 or 1):
ŷ = w₀ + w₁x₁ + w₂x₂ + ...
The output can be any real number (e.g., −4.7 or 13.2), which is meaningless as a probability. You also risk:
- Predictions outside
[0, 1] - A model that is highly sensitive to outliers
- Violation of the core probabilistic interpretation
Logistic Regression solves this by squashing the linear output into the (0, 1) range using the sigmoid function.
3. Mathematical Foundation
3.1 The Sigmoid Function
The sigmoid (logistic) function is:
σ(z) = 1 / (1 + e^(−z))
Where z is the linear combination of inputs:
z = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ
= wᵀx (in vector notation)
Properties of σ(z):
| Input z | σ(z) value | Interpretation |
|---|---|---|
| z → +∞ | σ(z) → 1 | Confident positive class |
| z = 0 | σ(z) = 0.5 | Perfect uncertainty |
| z → −∞ | σ(z) → 0 | Confident negative class |
The curve is S-shaped, continuous, and differentiable — critical properties for gradient-based optimization.
Derivative of sigmoid (crucial for backprop):
σ'(z) = σ(z) · (1 − σ(z))
This elegant self-referential derivative makes the math clean and efficient.
3.2 The Decision Boundary
The model outputs a probability. To make a class prediction, we apply a threshold (typically 0.5):
ŷ = 1 if σ(wᵀx) ≥ 0.5
ŷ = 0 if σ(wᵀx) < 0.5
Since σ(z) = 0.5 when z = 0, the decision boundary is where:
wᵀx = 0
This is a hyperplane in the feature space — a line in 2D, a plane in 3D, and so on. This is why logistic regression is a linear classifier: it can only separate classes with a straight line/plane.
3.3 Probability Interpretation
The full probabilistic model reads:
P(y=1 | x; w) = σ(wᵀx)
P(y=0 | x; w) = 1 − σ(wᵀx)
Compactly, for label y ∈ {0, 1}:
P(y | x; w) = σ(wᵀx)^y · (1 − σ(wᵀx))^(1−y)
This is the Bernoulli likelihood — the statistical engine that powers the loss function.
4. How It Works — Step by Step
Here's the full forward pass, walking through a concrete example.
Given: A patient's age (x₁ = 45) and cholesterol (x₂ = 230). Predict heart disease (1) or not (0).
Step 1 — Compute the linear score:
z = w₀ + w₁·45 + w₂·230
= −8.5 + 0.07·45 + 0.03·230
= −8.5 + 3.15 + 6.9
= 1.55
Step 2 — Apply sigmoid:
σ(1.55) = 1 / (1 + e^(−1.55))
= 1 / (1 + 0.212)
≈ 0.825
Step 3 — Interpret as probability:
P(heart disease | patient data) ≈ 82.5%
Step 4 — Apply threshold:
0.825 ≥ 0.5 → Predict: Heart Disease = YES
The threshold is a hyperparameter you can tune (e.g., set to 0.3 for high-recall scenarios in medical diagnosis).
5. How It Is Trained
5.1 Loss Function: Binary Cross-Entropy
We cannot use Mean Squared Error (MSE) for logistic regression because:
- The sigmoid makes the error surface non-convex under MSE
- Gradient descent can get stuck in local minima
- MSE doesn't align with the probabilistic interpretation
Instead, we use Binary Cross-Entropy Loss (also called Log Loss):
L(w) = −(1/m) Σᵢ [ yᵢ · log(ŷᵢ) + (1 − yᵢ) · log(1 − ŷᵢ) ]
Where:
m= number of training samplesyᵢ= true label for sample iŷᵢ= predicted probability for sample i
Intuition of each term:
| True label yᵢ | Active term | Behavior |
|---|---|---|
| 1 | −log(ŷᵢ) | Penalizes predicting low prob for a positive |
| 0 | −log(1 − ŷᵢ) | Penalizes predicting high prob for a negative |
Why log? The logarithm converts the product of probabilities (from the Bernoulli likelihood) into a sum — a numerically stable, convex function. The result is a convex loss surface with a single global minimum.
Cross-entropy per sample:
If y=1 and ŷ=0.9 → Loss = −log(0.9) ≈ 0.105 ✅ Low penalty
If y=1 and ŷ=0.1 → Loss = −log(0.1) ≈ 2.303 ❌ High penalty
5.2 Gradient Descent
Since there's no closed-form solution, we use iterative optimization. The gradient of the cross-entropy loss with respect to weights is:
∂L/∂w = (1/m) · Xᵀ · (ŷ − y)
Where (ŷ − y) is the vector of prediction errors. Notice this has the same form as linear regression's gradient — one of the beautiful symmetries of these models.
Three Variants:
| Variant | Update Frequency | Pros | Cons |
|---|---|---|---|
| Batch Gradient Descent | Once per full dataset | Stable, exact gradient | Slow on large datasets |
| Stochastic Gradient Descent | Once per sample | Fast, online learning | Noisy, oscillating loss |
| Mini-Batch GD | Once per batch | Best of both worlds | Requires tuning batch size |
5.3 The Update Rule
At each iteration, weights are updated:
w := w − α · ∂L/∂w
Where α (alpha) is the learning rate — a critical hyperparameter:
- Too high: Overshoots the minimum, diverges
- Too low: Converges too slowly
- Just right: Smooth convergence to the global minimum
Full training loop pseudocode:
initialize weights w = 0 (or random small values)
for epoch in range(num_epochs):
z = X @ w # Linear combination
ŷ = sigmoid(z) # Apply sigmoid
error = ŷ - y # Compute error
gradient = (1/m) * X.T @ error # Compute gradient
w = w - alpha * gradient # Update weights
loss = cross_entropy(y, ŷ) # Track loss
Convergence is detected when |L(wₜ₊₁) − L(wₜ)| < ε for some small tolerance ε.
Advanced optimizers like Adam, RMSProp, and L-BFGS can dramatically speed up convergence compared to vanilla gradient descent.
6. Multiclass Extension
Standard logistic regression handles binary problems. For K > 2 classes, two strategies exist:
One-vs-Rest (OvR)
Train K separate binary classifiers, one per class against all others. Assign the class with the highest predicted probability.
Classifier 1: "Is it Class A?" (vs. B, C, D)
Classifier 2: "Is it Class B?" (vs. A, C, D)
...
Limitation: Probabilities from each classifier don't sum to 1 and can overlap.
Softmax (Multinomial Logistic Regression)
A natural generalization using the softmax function:
P(y=k | x) = e^(wₖᵀx) / Σⱼ e^(wⱼᵀx)
All class probabilities sum to exactly 1. This is the principled approach and is used in neural networks' output layers.
7. Regularization
Logistic regression is prone to overfitting, especially with many features or correlated predictors. Regularization adds a penalty term to the loss:
L2 Regularization (Ridge)
L_reg(w) = L(w) + λ · Σ wⱼ²
- Shrinks all weights toward zero
- Never sets them exactly to zero
- Handles multicollinearity well
L1 Regularization (Lasso)
L_reg(w) = L(w) + λ · Σ |wⱼ|
- Can drive weights exactly to zero → automatic feature selection
- Creates sparse models
- Useful when you suspect many features are irrelevant
Elastic Net
L_reg(w) = L(w) + λ₁ · Σ|wⱼ| + λ₂ · Σwⱼ²
Combines both — the best of L1 and L2.
λ is the regularization strength:
λ = 0: No regularization (risk of overfitting)λ → ∞: All weights → 0 (risk of underfitting)
8. Assumptions of the Model
Logistic regression makes several assumptions that, when violated, degrade performance:
| Assumption | Description |
|---|---|
| Linear decision boundary | Features and log-odds must be linearly related |
| Independence of observations | Samples should not be correlated (e.g., time series violates this) |
| Little or no multicollinearity | Highly correlated features inflate variance and destabilize coefficients |
| Large sample size | MLE is asymptotically consistent — small samples give unreliable estimates |
| No extreme outliers | Outliers can disproportionately influence the decision boundary |
| Binary (or ordinal) dependent var | Assumes the output is categorical, not continuous |
9. Evaluation Metrics
Accuracy alone is insufficient — especially with imbalanced classes.
Confusion Matrix
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
Derived Metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
[[Precision]] = TP / (TP + FP) # Of all positives predicted, how many were correct?
Recall = TP / (TP + FN) # Of all actual positives, how many did we catch?
[[F1-Score]] = 2 · ([[Precision]] · Recall) / ([[Precision]] + Recall)
ROC-AUC
The Receiver Operating Characteristic curve plots True Positive Rate vs. False Positive Rate at all thresholds. The Area Under the Curve (AUC):
AUC = 1.0→ Perfect classifierAUC = 0.5→ Random guessingAUC < 0.5→ Worse than random
Log Loss
Measures the quality of probability estimates (not just the hard class labels):
Log Loss = −(1/m) Σ [ yᵢ·log(p̂ᵢ) + (1−yᵢ)·log(1−p̂ᵢ) ]
Lower is better. A model with log loss = 0 is perfect.
10. Advantages
✅ Probabilistic Output
Returns calibrated probabilities, not just labels. This is critical in medicine, finance, and risk assessment where how confident the model is matters.
✅ Highly Interpretable
Coefficients directly encode the relationship between features and the log-odds of the outcome:
log(P/(1−P)) = w₀ + w₁x₁ + ...
Each wⱼ tells you: "A one-unit increase in xⱼ multiplies the odds by eˢʷʲ."
✅ No Distributional Assumption on Features
Unlike Linear Discriminant Analysis (LDA), logistic regression does not assume features are normally distributed.
✅ Computationally Efficient
Training is fast even on large datasets. Convexity guarantees convergence to the global optimum.
✅ Robust to Small Datasets
Performs surprisingly well with limited data, especially compared to complex models.
✅ Excellent Baseline
Always use logistic regression as a baseline before trying complex models. If a neural network only marginally beats it, the added complexity may not be worth it.
✅ Handles Regularization Naturally
L1/L2 penalties are trivially added and well-studied.
✅ Scales Well
Works well with stochastic gradient descent on very large datasets (online learning).
11. Drawbacks & Limitations
❌ Linearity Constraint
The most fundamental limitation. Logistic regression cannot learn non-linear decision boundaries without explicit feature engineering (polynomial features, interaction terms, etc.).
If your data looks like two concentric rings, logistic regression will fail — it can only draw a straight line.
❌ Feature Engineering Required
To capture complex patterns, you must manually create new features (e.g., x₁², x₁·x₂). This requires domain expertise and is labor-intensive.
❌ Poor with Many Irrelevant Features
Performance degrades without feature selection or strong regularization when many features are noise.
❌ Multicollinearity Sensitivity
Highly correlated features cause coefficients to become unstable and hard to interpret.
❌ Class Imbalance Sensitivity
With severely imbalanced classes (e.g., 99% negative, 1% positive), the model may simply predict "negative" for everything and still achieve 99% Accuracy. Must use:
- Class weighting
- Resampling (SMOTE, oversampling)
- Threshold tuning
❌ Not Ideal for Complex Relationships
In domains with rich, high-dimensional, non-linear interactions (images, text, audio), deep learning will vastly outperform logistic regression.
❌ Assumes No Perfect Separation
If classes are perfectly linearly separable, Maximum Likelihood Estimation fails to converge (coefficients go to ±∞). This is called complete separation and requires regularization to fix.
12. Logistic Regression vs. Linear Regression
| Property | Linear Regression | Logistic Regression |
|---|---|---|
| Task | Regression | Classification |
| Output | Unbounded real value | Probability in (0, 1) |
| Activation | None (identity) | Sigmoid |
| Loss Function | MSE | Binary Cross-Entropy |
| Solution | Closed-form or GD | Gradient Descent (no closed form) |
| Interpretation | Direct effect on output | Effect on log-odds |
| Assumptions | Linearity, normality of errors | Linear log-odds, independence |
13. Logistic Regression vs. Other Classifiers
| Classifier | Non-linear | Probabilistic | Interpretable | Scalable | Overfitting Risk |
|---|---|---|---|---|---|
| Logistic Regression | ❌ | ✅ | ✅ | ✅ | Low |
| SVM | ✅ (kernel) | ❌ (margins) | ⚠️ | ⚠️ | Medium |
| Decision Tree | ✅ | ⚠️ | ✅ | ✅ | High |
| Random Forest | ✅ | ✅ | ❌ | ✅ | Low |
| Naive Bayes | ❌ | ✅ | ✅ | ✅ | Low |
| Neural Network | ✅ | ✅ | ❌ | ✅ | Very High |
| k-NN | ✅ | ⚠️ | ⚠️ | ❌ | Medium |
Note: Logistic Regression is often the sweet spot of interpretability + performance for linearly separable problems.
14. Practical Tips & Gotchas
Feature Scaling Is Essential
Logistic regression is not scale-invariant. Always standardize features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Without scaling, features with large ranges dominate the weights, and gradient descent converges slowly.
Handling Imbalanced Data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
Or use class_weight={0: 1, 1: 10} to manually upweight the minority class.
Adding Non-linearity
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)
This creates features like x₁², x₁·x₂, enabling curved decision boundaries.
Choosing the Regularization Hyperparameter
Use cross-validation over a grid:
from sklearn.model_selection import GridSearchCV
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(), params, cv=5, scoring='roc_auc')
grid.fit(X_train, y_train)
Note: In sklearn, C = 1/λ — higher C means less regularization.
Watch for Complete Separation
Symptoms: Extremely large coefficients, warning messages from the solver. Fix: Add L2 regularization (penalty='l2').
Solver Selection (sklearn)
| Solver | Best For |
|---|---|
lbfgs |
Small/medium datasets, L2 |
saga |
Large datasets, L1 or Elastic Net |
liblinear |
Small datasets, L1 |
newton-cg |
Large dense data, L2 |
15. When to Use It
Use Logistic Regression when:
- You need interpretability — regulators, doctors, or stakeholders need to understand predictions
- Your classes are approximately linearly separable
- You need a probability estimate, not just a hard label
- You have a limited dataset and want to avoid overfitting
- You need a fast baseline to compare against complex models
- You're doing online learning or streaming data
- Features are mostly relevant and not too correlated
Do NOT use Logistic Regression when:
- Data has complex non-linear structure (use tree-based models or neural networks)
- You have raw images, text, or audio without feature engineering
- You have extremely high-dimensional sparse data with complex interactions
- [[Accuracy]] is paramount and interpretability doesn't matter (use ensembles)
Summary
┌─────────────────────────────────────────────────────────┐
│ LOGISTIC REGRESSION AT A GLANCE │
├─────────────────────────────────────────────────────────┤
│ CORE IDEA Linear model + sigmoid → probability │
│ TRAINING Minimize cross-entropy via gradient desc │
│ DECISION Hyperplane: wᵀx = 0 │
│ OUTPUT P(y=1|x) ∈ (0, 1) │
│ STRENGTHS Interpretable, fast, probabilistic │
│ WEAKNESSES Linear boundary, needs feature eng. │
│ BEST FOR Binary [[Classification]], baseline models │
└─────────────────────────────────────────────────────────┘
Logistic Regression is not just a stepping stone to "real" ML — it is a production-grade algorithm used in credit scoring, medical diagnosis, ad-click prediction, and fraud detection at massive scale. Understanding it deeply means understanding the building blocks of neural networks, probabilistic modeling, and maximum likelihood estimation. Master this, and the rest of machine learning becomes clearer.
End of document. Total depth: fundamentals → math → training → tuning → evaluation → comparison → practical usage.