Machine Learning — Logistic Regression
Logistic Regression (Detailed Version)
Definition
Logistic Regression is a Supervised Machine Learning algorithm used for classification problems, where the output belongs to a category or class.
Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities and then converts them into classes.
Examples:
- Spam detection
- Disease prediction
- Loan approval prediction
- Customer churn prediction
- Pass/Fail prediction
Objective of Logistic Regression
The objective of Logistic Regression is:
To estimate the probability that an observation belongs to a particular class.
Example:
Input: Student information Output: Pass or Fail
Instead of directly predicting:
Pass
it predicts:
Probability of passing = 0.85
Then:
If probability > threshold
↓
Predict Pass
Else
↓
Predict Fail
Why Linear Regression cannot be used for Classification?
Suppose:
Study Hours → Pass/Fail
Linear Regression may predict:
1.5 -0.8 2.4
But probabilities should be:
0 ≤ Probability ≤ 1
So Linear Regression is not suitable.
Logistic Regression uses the Sigmoid Function to convert outputs into probabilities.
Sigmoid Function
The sigmoid function converts any value into the range:
0 to 1
The logistic function is central to Logistic Regression:
Where:
- σ(z) = Probability
- e = Euler constant
- z = Linear equation
Linear part:
z=
Combined model:
P(Y=1)=
Understanding Sigmoid Output
zSigmoid Output-50.006-20.1200.5020.8850.99
Interpretation:
Probability near 0 → Class 0 Probability near 1 → Class 1
Working of Logistic Regression
Step 1: Collect Data
Example:
Study HoursPass1020304151
Where:
0 → Fail 1 → Pass
Step 2: Preprocess Data
Tasks:
- Handle missing values
- Remove duplicates
- Scale numerical values
- Encode categorical variables
Step 3: Split Dataset
Typical split:
Training =80% Testing =20%
Step 4: Train Model
Algorithm learns coefficients:
Input Features
↓
Find coefficients
↓
Calculate probability
Step 5: Predict Probability
Example:
Study Hours=4
Model predicts:
P(pass)=0.85
Step 6: Apply Threshold
Usually:
Threshold=0.5
Decision:
If P>0.5
↓
Class=1
Else
↓
Class=0
Decision Boundary
The threshold creates a decision boundary.
Example:
Probability 1| ***** | **** 0.5-------------------- | **** | **** 0|_____________________
Above:
0.5 → Class 1
Below:
0.5 → Class 0
Types of Logistic Regression
1. Binary Logistic Regression
Two classes only.
Examples:
- Spam/Not Spam
- Pass/Fail
- Fraud/Not Fraud
2. Multinomial Logistic Regression
More than two classes.
Examples:
- Predicting:
Cat Dog Bird
3. Ordinal Logistic Regression
Classes have order.
Examples:
Poor Average Good Excellent
Cost Function in Logistic Regression
Linear Regression uses MSE.
Logistic Regression uses Log Loss (Cross Entropy Loss).
J(
Where:
- y = Actual value
- ŷ = Predicted probability
- m = Total observations
Goal:
Minimize Log Loss
Gradient Descent
Parameters are updated using:
Where:
- θ = Parameter
- α = Learning rate
Performance Metrics for Logistic Regression
Since Logistic Regression is a classification algorithm, regression metrics such as RMSE and MAE are not used.
1. Accuracy
Measures correctly predicted observations.
Accuracy=
Where:
- TP = True Positive
- TN = True Negative
- FP = False Positive
- FN = False Negative
2. Precision
Measures prediction correctness among predicted positives.
Precision=
Interpretation:
Out of predicted positive cases, how many were actually positive?
3. Recall
Measures how many actual positives were identified.
Recall=
Interpretation:
Out of actual positive cases, how many were correctly found?
4. F1 Score
Harmonic mean of Precision and Recall.
F1=2
5. Confusion Matrix
Predicted PositivePredicted NegativeActual PositiveTPFNActual NegativeFPTN6. ROC Curve and AUC
ROC:
True Positive Rate vs False Positive Rate
AUC interpretation:
1.0 → Perfect model 0.5 → Random prediction
Assumptions of Logistic Regression
- Binary or categorical target variable
- Independent observations
- Linear relationship between features and log-odds
- No multicollinearity
- Large sample size preferred
Advantages
- Simple and easy to implement
- Fast training
- Outputs probabilities
- Less computational cost
- Works well for linearly separable data
Disadvantages
- Cannot model highly complex relationships
- Sensitive to outliers
- Assumes linearity in log-odds
- Performance decreases with overlapping classes
Real-world Applications
- Email spam detection
- Medical diagnosis
- Credit card fraud detection
- Customer churn prediction
- Loan approval systems
- Sentiment analysis
Complete Workflow
Collect Data
↓
Preprocess Data
↓
Split Data
↓
Train Logistic Regression Model
↓
Predict Probability
↓
Apply Threshold
↓
Evaluate Performance
↓
Optimize Model
One-line summary
Logistic Regression is a supervised classification algorithm that predicts probabilities using the sigmoid function and classifies observations into categories.