Random Forest (Detailed Version)

Definition

Random Forest is a Supervised Machine Learning algorithm that combines multiple Decision Trees to make better predictions.

It is called Random Forest because:

  • Random → Uses random subsets of data and features
  • Forest → Collection of many decision trees

The main idea is:

Instead of depending on one decision tree, use multiple trees and combine their results to improve accuracy and reduce overfitting.

Examples:

  • Fraud detection
  • Disease prediction
  • Customer churn prediction
  • Loan approval
  • House price prediction

Why Random Forest?

Decision Trees have a major problem:

Single Decision Tree
        ↓
Can easily overfit
        ↓
Poor generalization

Random Forest solves this by:

Many Trees
      ↓
Combine predictions
      ↓
More accurate results

Basic Working of Random Forest

Suppose we have a dataset:

Study HoursAttendancePass260No365No685Yes890Yes

Random Forest creates many trees:

                 Dataset
                     |
     -----------------------------------
     |             |            |      |
   Tree 1       Tree 2      Tree 3   Tree 4
     |             |            |      |
   Pass         Pass         Fail   Pass
     |             |            |      |
     -----------------------------------
                     |
              Majority Voting
                     |
                  Pass

Final prediction:

Pass

Working of Random Forest (Step by Step)

Step 1: Create multiple random samples

Random Forest creates several datasets from the original dataset.

Method used:

Bootstrap Sampling

Example:

Original dataset:

A B C D E F

Random samples:

Sample 1: A C D F
Sample 2: B C E F
Sample 3: A B D E

Step 2: Build multiple decision trees

Each tree receives:

  • Different records
  • Different features

Example:

Tree 1:

Uses:
Study Hours
Attendance

Tree 2:

Uses:
Age
Attendance

Tree 3:

Uses:
Study Hours
Marks

Step 3: Train trees independently

Each tree creates its own decision rules.

Example:

Tree 1:

Study Hours >5
      ↓
Pass

Tree 2:

Attendance >70
      ↓
Pass

Step 4: Combine predictions

Classification:

Uses Majority Voting

Example:

Tree1 → Pass
Tree2 → Pass
Tree3 → Fail
Tree4 → Pass
Tree5 → Pass

Final prediction:

Pass

Regression:

Uses Average Prediction

Example:

Tree1 → 50000
Tree2 → 55000
Tree3 → 52000

Prediction:

(50000+55000+52000)/3

=52333

Why Random Forest performs better

Single Tree:

High variance
Sensitive to data changes

Random Forest:

Average of many trees
Less variance
More stable

Important Concepts in Random Forest

1. Bootstrap Sampling

Random sampling with replacement.

Example:

Original:

A B C D E

Possible sample:

A B B D E

Notice:

B repeated
C missing

2. Feature Randomness

Random Forest selects only a subset of features.

Suppose:

Dataset features:

Age
Salary
Experience
Attendance
Marks

Tree 1 may use:

Age
Salary

Tree 2:

Experience
Marks

Purpose:

Increase diversity among trees

Hyperparameters in Random Forest

1. Number of Trees

n_estimators

Example:

n_estimators=100

Meaning:

Build 100 trees

2. Maximum Depth

max_depth

Purpose:

Control tree size

3. Minimum Samples Split

min_samples_split

Purpose:

Minimum observations needed before split

4. Maximum Features

max_features

Purpose:

Number of features selected at each split

Performance Metrics for Classification Random Forest

1. Accuracy

Accuracy=\frac{TP+TN}{TP+TN+FP+FN}

Measures total correct predictions.

2. Precision

Precision=\frac{TP}{TP+FP}

Measures prediction correctness.

3. Recall

Recall=\frac{TP}{TP+FN}

Measures ability to identify positives.

4. F1 Score

F1=2\times\frac{Precision\times Recall}{Precision+Recall}

5. Confusion Matrix


Predicted PositivePredicted NegativeActual PositiveTPFNActual NegativeFPTN6. ROC-AUC Score

Measures classification quality.

Interpretation:

1 → Perfect model

0.5 → Random model

Performance Metrics for Regression Random Forest

MAE

MAE=\frac{1}{n}\sum|y_i-\hat y_i|

MSE

MSE=\frac{1}{n}\sum(y_i-\hat y_i)^2

RMSE

RMSE=\sqrt{\frac{1}{n}\sum(y_i-\hat y_i)^2}

R² Score

R^2=1-\frac{SS_{res}}{SS_{tot}}

Feature Importance

Random Forest can identify which features contribute most.

Example:

FeatureImportanceStudy Hours0.45Attendance0.30Age0.15Gender0.10

Interpretation:

Study Hours contributes most

Advantages

  1. High accuracy
  2. Reduces overfitting compared to Decision Tree
  3. Handles missing values reasonably well
  4. Works for classification and regression
  5. Handles large datasets effectively
  6. Provides feature importance

Disadvantages

  1. Slower than Decision Tree
  2. Requires more memory
  3. Less interpretable than a single Decision Tree
  4. Large forests can become computationally expensive

Real-world Applications

  • Fraud detection
  • Medical diagnosis
  • Recommendation systems
  • Customer churn prediction
  • Stock prediction
  • Credit risk analysis

Complete Workflow

Collect Data
      ↓
Create Bootstrap Samples
      ↓
Build Multiple Decision Trees
      ↓
Train Trees Independently
      ↓
Combine Predictions
      ↓
Predict Output
      ↓
Evaluate Performance

One-line summary

Random Forest is a supervised ensemble learning algorithm that combines multiple decision trees and uses voting or averaging to improve prediction accuracy and reduce overfitting.