DBSCAN Visualizer | Tools Hub

Epsilon (ε)

Min Points

Missing Policy

Plot Mode

Plot X

Plot Y

Plot Z (3D)

Point Size

View Mode

Zoom

Standardize Features Animate Clustering Show ε Radius Show Neighbor Lines Show Gridlines Show Density Grid

Animation Controls Ready

Speed 1.0x

0 / 0 events

Drag to pan

Scroll to zoom

3D: Drag=rotate, Shift+drag=pan

Load a CSV file or select sample data to begin.

Results

Clusters —

Noise Points —

Core Points —

Border Points —

Used Rows —

Runtime —

K-Distance Plot

Elbow point suggests optimal ε value

▼

Cluster Quality Metrics

Timing

Run DBSCAN to see metrics

Understanding Cluster Quality Metrics

Silhouette Score

Measures how similar each point is to its own cluster compared to other clusters. A point's silhouette value ranges from -1 to +1, where +1 means it's well-matched to its cluster and far from neighbors.

Range: -1 to +1 • Perfect: +1 • Higher is better

Davies-Bouldin Index

Measures the average "similarity" between each cluster and its most similar one. Similarity here is the ratio of within-cluster distances to between-cluster distances. Lower values indicate better separation.

Range: 0 to ∞ • Perfect: 0 • Lower is better

Calinski-Harabasz Index

Also called the Variance Ratio Criterion. Measures the ratio of between-cluster dispersion to within-cluster dispersion. Higher values suggest dense, well-separated clusters. Best used for comparing different parameter choices.

Range: 0 to ∞ • Higher is better • Compare relatively

DBCV (Density-Based Clustering Validation)

Specifically designed for density-based clustering like DBSCAN. Considers the density structure of clusters rather than just distances. Uses mutual reachability distances to assess cluster quality.

Range: -1 to +1 • Perfect: +1 • Higher is better

Noise Ratio

The fraction of points classified as noise. Some noise is normal for real-world data (1-10%). Very low noise might mean ε is too large; very high noise might mean ε is too small or min_pts too high.

Range: 0% to 100% • Ideal: 1-10% for most data

Size CV (Coefficient of Variation)

Measures how balanced your cluster sizes are. Low values mean clusters are similarly sized. High values indicate one dominant cluster or many tiny fragments, which might warrant parameter adjustment.

Range: 0 to ∞ • Perfect: 0 • Lower is better

💡 Tip: No single metric tells the whole story. Use these together with visual inspection. Different metrics may disagree! That's normal and reflects different aspects of clustering quality.

DBSCAN: what it is, what it's used for, and why it's different

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Instead of forcing clusters to be “round” (like k-means), it builds clusters by finding regions of high point density and expanding outward from them. Points that don't belong to any dense region are labeled as noise / outliers. ^[1]^[3]

What is it used for?

Arbitrary-shape clusters: long streaks, rings, blobs, multi-lobed shapes (not just spherical clusters). ^[1]
Outlier detection: “noise points” are explicitly labeled (often -1 in libraries). ^[3]
No need to pre-pick “k” clusters: you don't specify the number of clusters in advance. ^[3]

Important limitation: DBSCAN works best when clusters have similar density. If one cluster is very dense and another is very sparse, a single global eps can be too small for the sparse cluster (it becomes “noise”), or too large and merges clusters that should be separate. ^[3]^[1]

How many dimensions can DBSCAN handle?

DBSCAN is not limited to 2D plots. It works on data with any number of features (dimensions) as long as you can define a distance / similarity measure and do neighborhood queries (“find all points within eps of this point”). ^[3]

However, choosing eps becomes harder in high-dimensional data because distances can lose contrast (many points start to look similarly far apart), and that makes parameterization tricky. ^[2]

Key concepts (the vocabulary you'll see everywhere)

eps-neighborhood: all points within distance eps of a point. ^[3]
Core point: has at least minPts points in its eps-neighborhood (most implementations, including this one, count the point itself). ^[1]^[3]
Border point: is within eps of a core point, but does not itself have enough neighbors to be core. ^[1]
Noise point: not density-reachable from any core region (doesn't belong to any cluster). ^[1]
Density-reachable / density-connected: formal definitions that explain how clusters are “grown” through chains of core points (and attached border points). ^[1]

How DBSCAN works + picking good parameters (k-distance “elbow” for eps)

Short walkthrough of the algorithm

DBSCAN can be understood as: “find dense seeds, then flood-fill density-connected points.” The original paper describes expanding a cluster starting from a seed point and repeatedly querying neighborhoods to grow the cluster. ^[1]

Visit an unvisited point p and compute its neighbors within eps. ^[1]^[3]
If p has fewer than minPts neighbors, label it noise (it may later become a border point if reached by a nearby core). ^[1]
If p is a core point, start a new cluster and put its neighbors into a “seed list/queue”. ^[1]
Expand the cluster: pop a point from the seed list; if it is core, add its neighbors to the seed list. Continue until the seed list is empty (cluster cannot expand further). ^[1]
Repeat until all points are visited. ^[1]

Why it finds arbitrary shapes: the cluster grows through chains of density-reachable core points. This is why a ring can be found as one cluster even though it is not “round” in the k-means sense. ^[1]

The two parameters that matter: `minPts` and `eps`

Most DBSCAN implementations expose:

minPts (a.k.a. min_samples): how many points must be inside the eps-neighborhood for a point to count as core. Larger values demand denser clusters; smaller values allow sparser clusters. ^[3]
eps: the maximum distance at which two points are considered neighbors (used for all neighborhood queries). This is the most sensitive parameter; smaller values typically yield more (and smaller) clusters. ^[3]

Picking `minPts` (practical guidance)

In many datasets, minPts can stay near a small default value (e.g., 4 in 2D is cited as a common default). ^[2]
A commonly suggested heuristic is minPts = 2 * dim (twice the dataset dimensionality). If you have more noise, huge datasets, high dimensions, or duplicates, increasing minPts can help. ^[2]

Picking `eps` using a k-distance plot (the “elbow” idea)

A standard heuristic is to compute a k-distance plot: for each point, compute the distance to its k-th nearest neighbor, then sort these distances and plot them. The original DBSCAN paper discusses choosing eps using k-nearest-neighbor distances (for 2D, it describes using the distance to the 4th nearest neighbor as a heuristic). ^[2]

In DBSCAN “parameter heuristics” discussions, a common mapping is: k corresponds to minPts = k + 1 (because range queries include the point itself), so if you pick minPts, you often use k = minPts - 1 in the k-distance plot. ^[2]

k-distance plot recipe (high level)

1) Choose minPts (or try a few values)
2) For each point i:
     di = distance_to_kth_nearest_neighbor(i, k=minPts-1)
3) Sort {di} from largest to smallest (or smallest to largest - just be consistent)
4) Look for the “knee/elbow”:
     - left side: steep region (very isolated points / noise)
     - right side: flatter region (dense interiors)
5) Pick eps near the knee (often slightly below it)

Reality check: sometimes there is no clear “valley / knee / elbow.” In that case, you're effectively choosing a trade-off. Experience often favors the lower end of the plausible range (smaller eps) to avoid merging clusters. ^[2]

When DBSCAN becomes annoying

High-dimensional data: selecting eps becomes difficult as distance contrast degrades. Alternatives like OPTICS / HDBSCAN* remove the need to choose a single global eps (though high dimensionality is still hard in general). ^[2]
Mixed densities: one global eps can cause sparse clusters to disappear or dense clusters to merge. ^[1]^[3]

References

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD '96. (DBSCAN original paper) - PDF
Schubert, E., Sander, J., Ester, M., Kriegel, H.-P., & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM TODS, 42(3). (Parameter heuristics + k-distance plots + high-dimensional caveats) - DOI - PDF
scikit-learn documentation: sklearn.cluster.DBSCAN (eps/min_samples definitions, behavior notes, references). - Docs

Real-world applications: medicine, imaging, engineering, and science

DBSCAN's ability to find arbitrarily shaped clusters and explicitly identify outliers makes it valuable across many domains. Below are representative applications organized by field.

Medicine & Healthcare

Disease outbreak detection: Identifying spatial clusters of disease cases (e.g., COVID-19 hotspots, cancer clusters) where the shape of the outbreak region is unknown and isolated cases should be flagged as noise.
Patient stratification: Grouping patients by clinical biomarkers (lab values, vitals) to discover phenotypes or subtypes of a disease without assuming how many groups exist.
Anomaly detection in ECG/EEG signals: After feature extraction, DBSCAN can identify abnormal heartbeat or brainwave patterns as noise points distinct from normal clusters.
Genomics: Clustering gene expression profiles or single-cell RNA-seq data to identify cell types or states, where cluster shapes in high-dimensional space are non-spherical.

Image Processing & Computer Vision

Image segmentation: Grouping pixels by color/texture similarity in LAB or RGB space to segment objects without predefined region shapes.
Object detection preprocessing: Clustering keypoints or feature descriptors (e.g., SIFT, ORB) to identify distinct objects in a scene.
LiDAR point cloud processing: Segmenting 3D point clouds from autonomous vehicles or drones to identify objects (cars, pedestrians, buildings) with irregular shapes.
Medical imaging: Detecting tumors or lesions in CT/MRI scans by clustering voxel intensities, where tumor boundaries are irregular.
Satellite imagery: Identifying land-use regions, urban sprawl patterns, or deforestation areas from multispectral image features.

Engineering & Manufacturing

Predictive maintenance: Clustering sensor readings (vibration, temperature, pressure) from industrial equipment to detect anomalous operating states before failure.
Quality control: Identifying defective products on an assembly line by clustering measurement data; outliers represent manufacturing defects.
Network intrusion detection: Clustering network traffic patterns to identify normal behavior clusters and flag anomalous (potentially malicious) traffic as noise.
Structural health monitoring: Analyzing strain gauge or accelerometer data from bridges, buildings, or aircraft to detect damage patterns.
Semiconductor manufacturing: Clustering wafer test data to identify process drifts or equipment issues affecting chip yield.

Scientific Research

Astronomy: Identifying galaxy clusters, star clusters, or cosmic structures from survey data where cluster shapes are highly irregular.
Particle physics: Clustering particle tracks or energy deposits in detectors to reconstruct collision events and identify anomalous signatures.
Climate science: Grouping weather stations or grid cells by climate patterns to identify climate zones or detect unusual weather events.
Ecology: Clustering animal GPS tracks to identify home ranges, migration corridors, or unusual movement patterns indicating distress.
Chemistry/Materials science: Clustering molecular simulations or spectroscopy data to identify distinct molecular conformations or material phases.

Business & Social Science

Customer segmentation: Grouping customers by purchasing behavior or demographics without assuming the number of segments.
Fraud detection: Identifying unusual transaction patterns in banking or insurance as outliers from normal behavior clusters.
Social network analysis: Detecting communities in social graphs or identifying bot accounts as anomalies in user behavior feature space.
Urban planning: Clustering GPS or mobile phone data to identify activity centers, commuting patterns, or underserved areas.

Why DBSCAN for these applications?

No need to specify the number of clusters in advance
Can find clusters of arbitrary shape (not just spherical)
Built-in outlier detection (noise points)
Robust to outliers-they don't distort cluster centers
Deterministic results (unlike k-means with random initialization)

When to consider alternatives:

Varying density clusters: Use HDBSCAN or OPTICS instead
Very high dimensions: Consider dimensionality reduction first (PCA, UMAP)
Massive datasets: Use approximate methods or spatial indexing (this tool uses a KD-tree)
Streaming data: Consider incremental variants like DenStream

Technical Design Overview: Architecture & Implementation

This visualizer combines Rust/WebAssembly for high-performance clustering with JavaScript for UI orchestration and HTML5 Canvas for rendering. Clustering results are validated against sklearn.cluster.DBSCAN via Python backchecks.

Architecture

The application runs entirely in the browser across three layers: the UI layer (HTML/CSS) handles user input, visualization, and theming. The JavaScript orchestrator manages CSV parsing, data preparation, animation state, and coordinates calls to WebAssembly. The Rust/WASM compute core performs all the heavy lifting including KD-tree construction, DBSCAN clustering, core/border classification, and all metric calculations. Data passes between JavaScript and WASM via shared Float32Array buffers.

Separately, a Python validation script using scikit-learn verifies that WASM outputs match sklearn's DBSCAN exactly.

Technology Stack

Layer	Technology	Purpose
UI	HTML5, CSS3, Canvas API	User input, visualization, theming
Orchestrator	JavaScript (ES6+)	Data parsing, state management, WASM calls
Compute Core	Rust -> WebAssembly	Clustering, metrics, spatial indexing (KD-tree)
Validation	Python + scikit-learn	Correctness verification against sklearn

Performance

Rust/WASM provides roughly 10× speedup over pure JavaScript for clustering operations. On a 30,000-point dataset, JavaScript takes ~6,200ms while Rust/WASM completes in ~600ms. This is arbitrary since I put a lot more effort into the Rust code compared to the JS code.

Data Flow

CSV files are parsed in JavaScript, where users select numeric columns for clustering. The selected features are flattened into a row-major Float32Array and passed across the WASM boundary. The Rust core handles missing values (drop, impute, or fail), optionally standardizes features via z-score normalization, builds a KD-tree for efficient neighbor queries, runs DBSCAN, and returns cluster labels plus quality metrics back to JavaScript for rendering on the canvas.

Key Implementation Details

KD-Tree spatial index: O(n log n) average-case neighbor queries vs O(n^2) brute force.
f64 distance calculations: 64-bit float precision internally achieves 100% cluster assignment match with sklearn.
Missing value policies: Drop rows, impute with column mean, or fail on NaN.
Standardization: Optional z-score normalization computed on used rows only.

Clustering Quality Metrics

Computed in the WASM core:

Silhouette Score: Cluster cohesion and separation (-1 to +1)
Davies-Bouldin Index: Average cluster similarity (lower = better)
Calinski-Harabasz Index: Between/within cluster dispersion ratio (higher = better)
DBCV: Density-Based Clustering Validation, designed for DBSCAN
Noise Ratio & Cluster Size CV: Distribution diagnostics

Validation: Outputs are verified against sklearn.cluster.DBSCAN using permutation-invariant comparison. Expected: ARI >0.99, AMI >0.96, exact match rate >0.98.

Tradeoffs

Decision	Benefit	Cost
f32 storage, f64 compute	Memory efficiency	Cast overhead (~negligible)
Sampled Silhouette	Fast for large datasets	Slight variance
Single-threaded WASM	Wider browser support	No parallelism
KD-tree (not ball-tree)	Simpler implementation	Less optimal for high dims

Scalability

Points	Time	Recommendation
<10K	<100ms	Full metrics
10K-50K	100ms-1s	Full metrics
50K-100K	1-5s	Consider disabling DBCV
>100K	>5s	Consider chunking

Browser Compatibility: Requires WebAssembly (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+).

Version 2.1.1 - Feel free to find me on LinkedIn or email me at [email protected] if you'd like to discuss further.

Dataset

Features

Results

K-Distance Plot

Cluster Quality Metrics

Understanding Cluster Quality Metrics

Silhouette Score

Davies-Bouldin Index

Calinski-Harabasz Index

DBCV (Density-Based Clustering Validation)

Noise Ratio

Size CV (Coefficient of Variation)

What is DBSCAN?

What is it used for?

How many dimensions can DBSCAN handle?

Key concepts (the vocabulary you'll see everywhere)

Short walkthrough of the algorithm

The two parameters that matter: minPts and eps

Picking minPts (practical guidance)

Picking eps using a k-distance plot (the “elbow” idea)

When DBSCAN becomes annoying

References

Medicine & Healthcare

Image Processing & Computer Vision

Engineering & Manufacturing

Scientific Research

Business & Social Science

Architecture

Technology Stack

Performance

Data Flow

Key Implementation Details

Clustering Quality Metrics

Tradeoffs

Scalability

The two parameters that matter: `minPts` and `eps`

Picking `minPts` (practical guidance)

Picking `eps` using a k-distance plot (the “elbow” idea)