← Clustering & Clustering Techniques Hierarchical Clustering →

K-Means Clustering

Definition

K-Means Clustering is an Unsupervised Machine Learning algorithm used to divide data into K number of clustersbased on similarity.

The objective is:

To group similar data points into clusters while keeping dissimilar data points in separate clusters.

"K" represents:

Number of clusters

Examples:

Customer segmentation
Image compression
Student grouping
Recommendation systems
Market analysis

Why K-Means?

Suppose a shopping company has customer data:

CustomerAgeSpendingA22500B24600C505000D554500

Without labels:

Young Customer
High Spender

K-Means automatically identifies groups:

Cluster 1 → Young Customers

Cluster 2 → High-Spending Customers

Basic Working of K-Means

Input Data
      ↓
Choose K clusters
      ↓
Initialize centroids
      ↓
Assign data points
      ↓
Update centroids
      ↓
Repeat until convergence

Important Terminologies

1. Cluster

A group of similar data points.

Example:

Cluster 1:
Students

Cluster 2:
Professionals

2. Centroid

Centroid is the center point of a cluster.

Example:

Cluster

  ● ● ●
    X
  ● ●

Where:

X = Centroid

3. Distance Measure

K-Means calculates similarity using distance.

Most commonly:

Euclidean Distance

genui{"math_block_widget_always_prefetch_v2":{"content":"d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}"}}

Smaller distance:

Higher similarity

Step-by-Step Working of K-Means

Suppose we have data points:

PointsA(2,2)B(3,2)C(8,8)D(9,8)

Choose:

K=2

Step 1: Select number of clusters (K)

Choose:

K=2

Meaning:

Create 2 clusters

Step 2: Initialize centroids randomly

Suppose:

Centroid1=(2,2)

Centroid2=(8,8)

Step 3: Assign points to nearest centroid

Distance calculations:

A → Centroid1

B → Centroid1

C → Centroid2

D → Centroid2

Clusters:

Cluster1:
A,B

Cluster2:
C,D

Step 4: Recalculate centroids

New centroid formula:

Centroid=\frac{\sum x_i}{n}

Cluster 1:

(2+3)/2

=2.5

Cluster 2:

(8+9)/2

=8.5

Step 5: Repeat until convergence

Continue:

Assign
      ↓
Update centroid
      ↓
Assign again

Stop when:

Centroids stop changing

Objective Function of K-Means

K-Means minimizes Within Cluster Sum of Squares (WCSS).

Formula:

WCSS=\sum\sum||x_i-c_j||^2

Where:

xi = Data point
cj = Cluster centroid

Goal:

Minimize WCSS

Choosing the Optimal Value of K

One major challenge:

How many clusters should be used?

Common method:

Elbow Method

Process:

Run K-Means for multiple K values

K=1,2,3,4,5...

Compute WCSS
Plot graph

WCSS
 ^
 |
 |\
 | \
 |  \
 |   \
 |    \__
 |        \__
 |______________> K

The point where bend occurs:

Elbow Point

Optimal K:

K=3

Assumptions of K-Means

Clusters are approximately circular
Similar cluster sizes
Distance indicates similarity
Data is numerical

Performance Evaluation Metrics

1. Inertia (WCSS)

Measures compactness of clusters.

Formula:

Inertia=\sum(x_i-c_i)^2

Interpretation:

Lower Inertia
      ↓
Better clustering

2. Silhouette Score

Measures separation and compactness.

S=\frac{b-a}{max(a,b)}

Where:

a = Average distance within cluster
b = Average distance to nearest cluster

Range:

-1 → Poor

0 → Overlapping

1 → Perfect

3. Davies-Bouldin Index

Measures similarity between clusters.

Interpretation:

Lower value
      ↓
Better clusters

Advantages

Simple to implement
Fast computation
Efficient on large datasets
Easy interpretation

Disadvantages

Must specify K beforehand
Sensitive to outliers
Different initialization may give different results
Works poorly with irregular cluster shapes

Real-world Applications

Business

Customer segmentation

Banking

Fraud pattern detection

Healthcare

Disease grouping

Image Processing

Image compression

Education

Student performance grouping

Complete Workflow

Collect Data
      ↓
Choose K
      ↓
Initialize Centroids
      ↓
Assign Data Points
      ↓
Update Centroids
      ↓
Repeat Process
      ↓
Generate Clusters
      ↓
Evaluate Clusters

One-line summary

K-Means is an unsupervised clustering algorithm that divides data into K clusters by minimizing the distance between data points and cluster centroids.

← Previous: Clustering & Clustering Techniques Next: Hierarchical Clustering →

Machine Learning — K-means Clustering

K-Means Clustering

Definition

Why K-Means?

Basic Working of K-Means

Important Terminologies

1. Cluster

2. Centroid

3. Distance Measure

Euclidean Distance

Step-by-Step Working of K-Means

Step 1: Select number of clusters (K)

Step 2: Initialize centroids randomly

Step 3: Assign points to nearest centroid

Step 4: Recalculate centroids

Step 5: Repeat until convergence

Objective Function of K-Means

Choosing the Optimal Value of K

Elbow Method

Assumptions of K-Means

Performance Evaluation Metrics

1. Inertia (WCSS)

2. Silhouette Score

3. Davies-Bouldin Index

Advantages

Disadvantages

Real-world Applications

Business

Banking

Healthcare

Image Processing

Education

Complete Workflow

One-line summary