Machine Learning — K-means Clustering
K-Means Clustering
Definition
K-Means Clustering is an Unsupervised Machine Learning algorithm used to divide data into K number of clustersbased on similarity.
The objective is:
To group similar data points into clusters while keeping dissimilar data points in separate clusters.
"K" represents:
Number of clusters
Examples:
- Customer segmentation
- Image compression
- Student grouping
- Recommendation systems
- Market analysis
Why K-Means?
Suppose a shopping company has customer data:
CustomerAgeSpendingA22500B24600C505000D554500
Without labels:
Young Customer High Spender
K-Means automatically identifies groups:
Cluster 1 → Young Customers Cluster 2 → High-Spending Customers
Basic Working of K-Means
Input Data
↓
Choose K clusters
↓
Initialize centroids
↓
Assign data points
↓
Update centroids
↓
Repeat until convergence
Important Terminologies
1. Cluster
A group of similar data points.
Example:
Cluster 1: Students Cluster 2: Professionals
2. Centroid
Centroid is the center point of a cluster.
Example:
Cluster
● ● ●
X
● ●
Where:
X = Centroid
3. Distance Measure
K-Means calculates similarity using distance.
Most commonly:
Euclidean Distance
genui{"math_block_widget_always_prefetch_v2":{"content":"d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}"}}
Smaller distance:
Higher similarity
Step-by-Step Working of K-Means
Suppose we have data points:
PointsA(2,2)B(3,2)C(8,8)D(9,8)
Choose:
K=2
Step 1: Select number of clusters (K)
Choose:
K=2
Meaning:
Create 2 clusters
Step 2: Initialize centroids randomly
Suppose:
Centroid1=(2,2) Centroid2=(8,8)
Step 3: Assign points to nearest centroid
Distance calculations:
A → Centroid1 B → Centroid1 C → Centroid2 D → Centroid2
Clusters:
Cluster1: A,B Cluster2: C,D
Step 4: Recalculate centroids
New centroid formula:
Centroid=\frac{\sum x_i}{n}
Cluster 1:
(2+3)/2 =2.5
Cluster 2:
(8+9)/2 =8.5
Step 5: Repeat until convergence
Continue:
Assign
↓
Update centroid
↓
Assign again
Stop when:
Centroids stop changing
Objective Function of K-Means
K-Means minimizes Within Cluster Sum of Squares (WCSS).
Formula:
WCSS=\sum\sum||x_i-c_j||^2
Where:
- xi = Data point
- cj = Cluster centroid
Goal:
Minimize WCSS
Choosing the Optimal Value of K
One major challenge:
How many clusters should be used?
Common method:
Elbow Method
Process:
- Run K-Means for multiple K values
K=1,2,3,4,5...
- Compute WCSS
- Plot graph
WCSS ^ | |\ | \ | \ | \ | \__ | \__ |______________> K
The point where bend occurs:
Elbow Point
Optimal K:
K=3
Assumptions of K-Means
- Clusters are approximately circular
- Similar cluster sizes
- Distance indicates similarity
- Data is numerical
Performance Evaluation Metrics
1. Inertia (WCSS)
Measures compactness of clusters.
Formula:
Inertia=\sum(x_i-c_i)^2
Interpretation:
Lower Inertia
↓
Better clustering
2. Silhouette Score
Measures separation and compactness.
S=\frac{b-a}{max(a,b)}
Where:
- a = Average distance within cluster
- b = Average distance to nearest cluster
Range:
-1 → Poor 0 → Overlapping 1 → Perfect
3. Davies-Bouldin Index
Measures similarity between clusters.
Interpretation:
Lower value
↓
Better clusters
Advantages
- Simple to implement
- Fast computation
- Efficient on large datasets
- Easy interpretation
Disadvantages
- Must specify K beforehand
- Sensitive to outliers
- Different initialization may give different results
- Works poorly with irregular cluster shapes
Real-world Applications
Business
- Customer segmentation
Banking
- Fraud pattern detection
Healthcare
- Disease grouping
Image Processing
- Image compression
Education
- Student performance grouping
Complete Workflow
Collect Data
↓
Choose K
↓
Initialize Centroids
↓
Assign Data Points
↓
Update Centroids
↓
Repeat Process
↓
Generate Clusters
↓
Evaluate Clusters
One-line summary
K-Means is an unsupervised clustering algorithm that divides data into K clusters by minimizing the distance between data points and cluster centroids.