K-Means Clustering

Definition

K-Means Clustering is an Unsupervised Machine Learning algorithm used to divide data into K number of clustersbased on similarity.

The objective is:

To group similar data points into clusters while keeping dissimilar data points in separate clusters.

"K" represents:

Number of clusters

Examples:

  • Customer segmentation
  • Image compression
  • Student grouping
  • Recommendation systems
  • Market analysis

Why K-Means?

Suppose a shopping company has customer data:

CustomerAgeSpendingA22500B24600C505000D554500

Without labels:

Young Customer
High Spender

K-Means automatically identifies groups:

Cluster 1 → Young Customers

Cluster 2 → High-Spending Customers

Basic Working of K-Means

Input Data
      ↓
Choose K clusters
      ↓
Initialize centroids
      ↓
Assign data points
      ↓
Update centroids
      ↓
Repeat until convergence

Important Terminologies

1. Cluster

A group of similar data points.

Example:

Cluster 1:
Students

Cluster 2:
Professionals

2. Centroid

Centroid is the center point of a cluster.

Example:

Cluster

  ● ● ●
    X
  ● ●

Where:

X = Centroid

3. Distance Measure

K-Means calculates similarity using distance.

Most commonly:

Euclidean Distance

genui{"math_block_widget_always_prefetch_v2":{"content":"d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}"}}

Smaller distance:

Higher similarity

Step-by-Step Working of K-Means

Suppose we have data points:

PointsA(2,2)B(3,2)C(8,8)D(9,8)

Choose:

K=2

Step 1: Select number of clusters (K)

Choose:

K=2

Meaning:

Create 2 clusters

Step 2: Initialize centroids randomly

Suppose:

Centroid1=(2,2)

Centroid2=(8,8)

Step 3: Assign points to nearest centroid

Distance calculations:

A → Centroid1

B → Centroid1

C → Centroid2

D → Centroid2

Clusters:

Cluster1:
A,B

Cluster2:
C,D

Step 4: Recalculate centroids

New centroid formula:

Centroid=\frac{\sum x_i}{n}

Cluster 1:

(2+3)/2

=2.5

Cluster 2:

(8+9)/2

=8.5

Step 5: Repeat until convergence

Continue:

Assign
      ↓
Update centroid
      ↓
Assign again

Stop when:

Centroids stop changing

Objective Function of K-Means

K-Means minimizes Within Cluster Sum of Squares (WCSS).

Formula:

WCSS=\sum\sum||x_i-c_j||^2

Where:

  • xi = Data point
  • cj = Cluster centroid

Goal:

Minimize WCSS

Choosing the Optimal Value of K

One major challenge:

How many clusters should be used?

Common method:

Elbow Method

Process:

  1. Run K-Means for multiple K values
K=1,2,3,4,5...
  1. Compute WCSS
  2. Plot graph
WCSS
 ^
 |
 |\
 | \
 |  \
 |   \
 |    \__
 |        \__
 |______________> K

The point where bend occurs:

Elbow Point

Optimal K:

K=3

Assumptions of K-Means

  1. Clusters are approximately circular
  2. Similar cluster sizes
  3. Distance indicates similarity
  4. Data is numerical

Performance Evaluation Metrics

1. Inertia (WCSS)

Measures compactness of clusters.

Formula:

Inertia=\sum(x_i-c_i)^2

Interpretation:

Lower Inertia
      ↓
Better clustering

2. Silhouette Score

Measures separation and compactness.

S=\frac{b-a}{max(a,b)}

Where:

  • a = Average distance within cluster
  • b = Average distance to nearest cluster

Range:

-1 → Poor

0 → Overlapping

1 → Perfect

3. Davies-Bouldin Index

Measures similarity between clusters.

Interpretation:

Lower value
      ↓
Better clusters

Advantages

  1. Simple to implement
  2. Fast computation
  3. Efficient on large datasets
  4. Easy interpretation

Disadvantages

  1. Must specify K beforehand
  2. Sensitive to outliers
  3. Different initialization may give different results
  4. Works poorly with irregular cluster shapes

Real-world Applications

Business

  • Customer segmentation

Banking

  • Fraud pattern detection

Healthcare

  • Disease grouping

Image Processing

  • Image compression

Education

  • Student performance grouping

Complete Workflow

Collect Data
      ↓
Choose K
      ↓
Initialize Centroids
      ↓
Assign Data Points
      ↓
Update Centroids
      ↓
Repeat Process
      ↓
Generate Clusters
      ↓
Evaluate Clusters

One-line summary

K-Means is an unsupervised clustering algorithm that divides data into K clusters by minimizing the distance between data points and cluster centroids.