What is DBSCAN?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Instead of forcing clusters to be “round” (like k-means), it builds clusters by finding regions of high point density and expanding outward from them. Points that don't belong to any dense region are labeled as noise / outliers. [1][3]
What is it used for?
- Arbitrary-shape clusters: long streaks, rings, blobs, multi-lobed shapes (not just spherical clusters). [1]
-
Outlier detection: “noise points” are explicitly labeled (often
-1in libraries). [3] - No need to pre-pick “k” clusters: you don't specify the number of clusters in advance. [3]
eps can be too small
for the sparse cluster (it becomes “noise”), or too large and merges clusters that should be separate.
[3][1]
How many dimensions can DBSCAN handle?
DBSCAN is not limited to 2D plots. It works on data with any number of features
(dimensions) as long as you can define a distance / similarity measure and do neighborhood queries
(“find all points within eps of this point”). [3]
However, choosing eps becomes harder in high-dimensional data
because distances can lose contrast (many points start to look similarly far apart),
and that makes parameterization tricky. [2]
Key concepts (the vocabulary you'll see everywhere)
-
eps-neighborhood: all points within distanceepsof a point. [3] -
Core point: has at least
minPtspoints in itseps-neighborhood (most implementations, including this one, count the point itself). [1][3] -
Border point: is within
epsof a core point, but does not itself have enough neighbors to be core. [1] - Noise point: not density-reachable from any core region (doesn't belong to any cluster). [1]
- Density-reachable / density-connected: formal definitions that explain how clusters are “grown” through chains of core points (and attached border points). [1]