Machine Learning — Random Forest
Random Forest (Detailed Version)
Definition
Random Forest is a Supervised Machine Learning algorithm that combines multiple Decision Trees to make better predictions.
It is called Random Forest because:
- Random → Uses random subsets of data and features
- Forest → Collection of many decision trees
The main idea is:
Instead of depending on one decision tree, use multiple trees and combine their results to improve accuracy and reduce overfitting.
Examples:
- Fraud detection
- Disease prediction
- Customer churn prediction
- Loan approval
- House price prediction
Why Random Forest?
Decision Trees have a major problem:
Single Decision Tree
↓
Can easily overfit
↓
Poor generalization
Random Forest solves this by:
Many Trees
↓
Combine predictions
↓
More accurate results
Basic Working of Random Forest
Suppose we have a dataset:
Study HoursAttendancePass260No365No685Yes890Yes
Random Forest creates many trees:
Dataset
|
-----------------------------------
| | | |
Tree 1 Tree 2 Tree 3 Tree 4
| | | |
Pass Pass Fail Pass
| | | |
-----------------------------------
|
Majority Voting
|
Pass
Final prediction:
Pass
Working of Random Forest (Step by Step)
Step 1: Create multiple random samples
Random Forest creates several datasets from the original dataset.
Method used:
Bootstrap Sampling
Example:
Original dataset:
A B C D E F
Random samples:
Sample 1: A C D F Sample 2: B C E F Sample 3: A B D E
Step 2: Build multiple decision trees
Each tree receives:
- Different records
- Different features
Example:
Tree 1:
Uses: Study Hours Attendance
Tree 2:
Uses: Age Attendance
Tree 3:
Uses: Study Hours Marks
Step 3: Train trees independently
Each tree creates its own decision rules.
Example:
Tree 1:
Study Hours >5
↓
Pass
Tree 2:
Attendance >70
↓
Pass
Step 4: Combine predictions
Classification:
Uses Majority Voting
Example:
Tree1 → Pass Tree2 → Pass Tree3 → Fail Tree4 → Pass Tree5 → Pass
Final prediction:
Pass
Regression:
Uses Average Prediction
Example:
Tree1 → 50000 Tree2 → 55000 Tree3 → 52000
Prediction:
(50000+55000+52000)/3 =52333
Why Random Forest performs better
Single Tree:
High variance Sensitive to data changes
Random Forest:
Average of many trees Less variance More stable
Important Concepts in Random Forest
1. Bootstrap Sampling
Random sampling with replacement.
Example:
Original:
A B C D E
Possible sample:
A B B D E
Notice:
B repeated C missing
2. Feature Randomness
Random Forest selects only a subset of features.
Suppose:
Dataset features:
Age Salary Experience Attendance Marks
Tree 1 may use:
Age Salary
Tree 2:
Experience Marks
Purpose:
Increase diversity among trees
Hyperparameters in Random Forest
1. Number of Trees
n_estimators
Example:
n_estimators=100
Meaning:
Build 100 trees
2. Maximum Depth
max_depth
Purpose:
Control tree size
3. Minimum Samples Split
min_samples_split
Purpose:
Minimum observations needed before split
4. Maximum Features
max_features
Purpose:
Number of features selected at each split
Performance Metrics for Classification Random Forest
1. Accuracy
Accuracy=\frac{TP+TN}{TP+TN+FP+FN}
Measures total correct predictions.
2. Precision
Precision=\frac{TP}{TP+FP}
Measures prediction correctness.
3. Recall
Recall=\frac{TP}{TP+FN}
Measures ability to identify positives.
4. F1 Score
F1=2\times\frac{Precision\times Recall}{Precision+Recall}
5. Confusion Matrix
Predicted PositivePredicted NegativeActual PositiveTPFNActual NegativeFPTN6. ROC-AUC Score
Measures classification quality.
Interpretation:
1 → Perfect model 0.5 → Random model
Performance Metrics for Regression Random Forest
MAE
MAE=\frac{1}{n}\sum|y_i-\hat y_i|
MSE
MSE=\frac{1}{n}\sum(y_i-\hat y_i)^2
RMSE
RMSE=\sqrt{\frac{1}{n}\sum(y_i-\hat y_i)^2}
R² Score
R^2=1-\frac{SS_{res}}{SS_{tot}}
Feature Importance
Random Forest can identify which features contribute most.
Example:
FeatureImportanceStudy Hours0.45Attendance0.30Age0.15Gender0.10
Interpretation:
Study Hours contributes most
Advantages
- High accuracy
- Reduces overfitting compared to Decision Tree
- Handles missing values reasonably well
- Works for classification and regression
- Handles large datasets effectively
- Provides feature importance
Disadvantages
- Slower than Decision Tree
- Requires more memory
- Less interpretable than a single Decision Tree
- Large forests can become computationally expensive
Real-world Applications
- Fraud detection
- Medical diagnosis
- Recommendation systems
- Customer churn prediction
- Stock prediction
- Credit risk analysis
Complete Workflow
Collect Data
↓
Create Bootstrap Samples
↓
Build Multiple Decision Trees
↓
Train Trees Independently
↓
Combine Predictions
↓
Predict Output
↓
Evaluate Performance
One-line summary
Random Forest is a supervised ensemble learning algorithm that combines multiple decision trees and uses voting or averaging to improve prediction accuracy and reduce overfitting.