Logistic Regression (Detailed Version)

Definition

Logistic Regression is a Supervised Machine Learning algorithm used for classification problems, where the output belongs to a category or class.

Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities and then converts them into classes.

Examples:

  • Spam detection
  • Disease prediction
  • Loan approval prediction
  • Customer churn prediction
  • Pass/Fail prediction

Objective of Logistic Regression

The objective of Logistic Regression is:

To estimate the probability that an observation belongs to a particular class.

Example:

Input: Student information
Output: Pass or Fail

Instead of directly predicting:

Pass

it predicts:

Probability of passing = 0.85

Then:

If probability > threshold
        ↓
Predict Pass

Else
        ↓
Predict Fail

Why Linear Regression cannot be used for Classification?

Suppose:

Study Hours → Pass/Fail

Linear Regression may predict:

1.5
-0.8
2.4

But probabilities should be:

0 ≤ Probability ≤ 1

So Linear Regression is not suitable.

Logistic Regression uses the Sigmoid Function to convert outputs into probabilities.

Sigmoid Function

The sigmoid function converts any value into the range:

0 to 1

The logistic function is central to Logistic Regression:


Where:

  • σ(z) = Probability
  • e = Euler constant
  • z = Linear equation

Linear part:

z=

Combined model:

P(Y=1)=

Understanding Sigmoid Output

zSigmoid Output-50.006-20.1200.5020.8850.99

Interpretation:

Probability near 0 → Class 0

Probability near 1 → Class 1

Working of Logistic Regression

Step 1: Collect Data

Example:

Study HoursPass1020304151

Where:

0 → Fail

1 → Pass

Step 2: Preprocess Data

Tasks:

  • Handle missing values
  • Remove duplicates
  • Scale numerical values
  • Encode categorical variables

Step 3: Split Dataset

Typical split:

Training =80%

Testing =20%

Step 4: Train Model

Algorithm learns coefficients:

Input Features
        ↓
Find coefficients
        ↓
Calculate probability

Step 5: Predict Probability

Example:

Study Hours=4

Model predicts:

P(pass)=0.85

Step 6: Apply Threshold

Usually:

Threshold=0.5

Decision:

If P>0.5
       ↓
Class=1

Else
       ↓
Class=0

Decision Boundary

The threshold creates a decision boundary.

Example:

Probability

1|              *****
 |           ****
0.5--------------------
 |      ****
 |  ****
0|_____________________

Above:

0.5 → Class 1

Below:

0.5 → Class 0

Types of Logistic Regression

1. Binary Logistic Regression

Two classes only.

Examples:

  • Spam/Not Spam
  • Pass/Fail
  • Fraud/Not Fraud

2. Multinomial Logistic Regression

More than two classes.

Examples:

  • Predicting:
Cat
Dog
Bird

3. Ordinal Logistic Regression

Classes have order.

Examples:

Poor
Average
Good
Excellent

Cost Function in Logistic Regression

Linear Regression uses MSE.

Logistic Regression uses Log Loss (Cross Entropy Loss).

J(

Where:

  • y = Actual value
  • ŷ = Predicted probability
  • m = Total observations

Goal:

Minimize Log Loss

Gradient Descent

Parameters are updated using:


Where:

  • θ = Parameter
  • α = Learning rate

Performance Metrics for Logistic Regression

Since Logistic Regression is a classification algorithm, regression metrics such as RMSE and MAE are not used.

1. Accuracy

Measures correctly predicted observations.

Accuracy=

Where:

  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

2. Precision

Measures prediction correctness among predicted positives.

Precision=

Interpretation:

Out of predicted positive cases,
how many were actually positive?

3. Recall

Measures how many actual positives were identified.

Recall=

Interpretation:

Out of actual positive cases,
how many were correctly found?

4. F1 Score

Harmonic mean of Precision and Recall.

F1=2

5. Confusion Matrix


Predicted PositivePredicted NegativeActual PositiveTPFNActual NegativeFPTN6. ROC Curve and AUC

ROC:

True Positive Rate
vs
False Positive Rate

AUC interpretation:

1.0 → Perfect model

0.5 → Random prediction

Assumptions of Logistic Regression

  1. Binary or categorical target variable
  2. Independent observations
  3. Linear relationship between features and log-odds
  4. No multicollinearity
  5. Large sample size preferred

Advantages

  1. Simple and easy to implement
  2. Fast training
  3. Outputs probabilities
  4. Less computational cost
  5. Works well for linearly separable data

Disadvantages

  1. Cannot model highly complex relationships
  2. Sensitive to outliers
  3. Assumes linearity in log-odds
  4. Performance decreases with overlapping classes

Real-world Applications

  • Email spam detection
  • Medical diagnosis
  • Credit card fraud detection
  • Customer churn prediction
  • Loan approval systems
  • Sentiment analysis

Complete Workflow

Collect Data
      ↓
Preprocess Data
      ↓
Split Data
      ↓
Train Logistic Regression Model
      ↓
Predict Probability
      ↓
Apply Threshold
      ↓
Evaluate Performance
      ↓
Optimize Model

One-line summary

Logistic Regression is a supervised classification algorithm that predicts probabilities using the sigmoid function and classifies observations into categories.