DBSCAN Visualizer

// DBSCAN • 2D/3D Viz • Rust WASM • Clustering Metrics

ENGINE: LOADING UI: 1.0.0
Load CSV
Animation Controls Ready
1.0x
0 / 0 events
Drag to pan
Scroll to zoom
3D: Drag=rotate, Shift+drag=pan
Load a CSV file or select sample data to begin.

Understanding Cluster Quality Metrics

Silhouette Score

Measures how similar each point is to its own cluster compared to other clusters. A point's silhouette value ranges from -1 to +1, where +1 means it's well-matched to its cluster and far from neighbors.

Range: -1 to +1 • Perfect: +1 • Higher is better

Davies-Bouldin Index

Measures the average "similarity" between each cluster and its most similar one. Similarity here is the ratio of within-cluster distances to between-cluster distances. Lower values indicate better separation.

Range: 0 to ∞ • Perfect: 0 • Lower is better

Calinski-Harabasz Index

Also called the Variance Ratio Criterion. Measures the ratio of between-cluster dispersion to within-cluster dispersion. Higher values suggest dense, well-separated clusters. Best used for comparing different parameter choices.

Range: 0 to ∞ • Higher is better • Compare relatively

DBCV (Density-Based Clustering Validation)

Specifically designed for density-based clustering like DBSCAN. Considers the density structure of clusters rather than just distances. Uses mutual reachability distances to assess cluster quality.

Range: -1 to +1 • Perfect: +1 • Higher is better

Noise Ratio

The fraction of points classified as noise. Some noise is normal for real-world data (1-10%). Very low noise might mean ε is too large; very high noise might mean ε is too small or min_pts too high.

Range: 0% to 100% • Ideal: 1-10% for most data

Size CV (Coefficient of Variation)

Measures how balanced your cluster sizes are. Low values mean clusters are similarly sized. High values indicate one dominant cluster or many tiny fragments, which might warrant parameter adjustment.

Range: 0 to ∞ • Perfect: 0 • Lower is better

💡 Tip: No single metric tells the whole story. Use these together with visual inspection. Different metrics may disagree! That's normal and reflects different aspects of clustering quality.

DBSCAN: what it is, what it's used for, and why it's different

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Instead of forcing clusters to be “round” (like k-means), it builds clusters by finding regions of high point density and expanding outward from them. Points that don't belong to any dense region are labeled as noise / outliers. [1][3]

What is it used for?

  • Arbitrary-shape clusters: long streaks, rings, blobs, multi-lobed shapes (not just spherical clusters). [1]
  • Outlier detection: “noise points” are explicitly labeled (often -1 in libraries). [3]
  • No need to pre-pick “k” clusters: you don't specify the number of clusters in advance. [3]
Important limitation: DBSCAN works best when clusters have similar density. If one cluster is very dense and another is very sparse, a single global eps can be too small for the sparse cluster (it becomes “noise”), or too large and merges clusters that should be separate. [3][1]

How many dimensions can DBSCAN handle?

DBSCAN is not limited to 2D plots. It works on data with any number of features (dimensions) as long as you can define a distance / similarity measure and do neighborhood queries (“find all points within eps of this point”). [3]

However, choosing eps becomes harder in high-dimensional data because distances can lose contrast (many points start to look similarly far apart), and that makes parameterization tricky. [2]

Key concepts (the vocabulary you'll see everywhere)

  • eps-neighborhood: all points within distance eps of a point. [3]
  • Core point: has at least minPts points in its eps-neighborhood (most implementations, including this one, count the point itself). [1][3]
  • Border point: is within eps of a core point, but does not itself have enough neighbors to be core. [1]
  • Noise point: not density-reachable from any core region (doesn't belong to any cluster). [1]
  • Density-reachable / density-connected: formal definitions that explain how clusters are “grown” through chains of core points (and attached border points). [1]
How DBSCAN works + picking good parameters (k-distance “elbow” for eps)

Short walkthrough of the algorithm

DBSCAN can be understood as: “find dense seeds, then flood-fill density-connected points.” The original paper describes expanding a cluster starting from a seed point and repeatedly querying neighborhoods to grow the cluster. [1]

  1. Visit an unvisited point p and compute its neighbors within eps. [1][3]
  2. If p has fewer than minPts neighbors, label it noise (it may later become a border point if reached by a nearby core). [1]
  3. If p is a core point, start a new cluster and put its neighbors into a “seed list/queue”. [1]
  4. Expand the cluster: pop a point from the seed list; if it is core, add its neighbors to the seed list. Continue until the seed list is empty (cluster cannot expand further). [1]
  5. Repeat until all points are visited. [1]
Why it finds arbitrary shapes: the cluster grows through chains of density-reachable core points. This is why a ring can be found as one cluster even though it is not “round” in the k-means sense. [1]

The two parameters that matter: minPts and eps

Most DBSCAN implementations expose:

  • minPts (a.k.a. min_samples): how many points must be inside the eps-neighborhood for a point to count as core. Larger values demand denser clusters; smaller values allow sparser clusters. [3]
  • eps: the maximum distance at which two points are considered neighbors (used for all neighborhood queries). This is the most sensitive parameter; smaller values typically yield more (and smaller) clusters. [3]

Picking minPts (practical guidance)

  • In many datasets, minPts can stay near a small default value (e.g., 4 in 2D is cited as a common default). [2]
  • A commonly suggested heuristic is minPts = 2 * dim (twice the dataset dimensionality). If you have more noise, huge datasets, high dimensions, or duplicates, increasing minPts can help. [2]

Picking eps using a k-distance plot (the “elbow” idea)

A standard heuristic is to compute a k-distance plot: for each point, compute the distance to its k-th nearest neighbor, then sort these distances and plot them. The original DBSCAN paper discusses choosing eps using k-nearest-neighbor distances (for 2D, it describes using the distance to the 4th nearest neighbor as a heuristic). [2]

In DBSCAN “parameter heuristics” discussions, a common mapping is: k corresponds to minPts = k + 1 (because range queries include the point itself), so if you pick minPts, you often use k = minPts - 1 in the k-distance plot. [2]

k-distance plot recipe (high level)

1) Choose minPts (or try a few values)
2) For each point i:
     di = distance_to_kth_nearest_neighbor(i, k=minPts-1)
3) Sort {di} from largest to smallest (or smallest to largest - just be consistent)
4) Look for the “knee/elbow”:
     - left side: steep region (very isolated points / noise)
     - right side: flatter region (dense interiors)
5) Pick eps near the knee (often slightly below it)
Reality check: sometimes there is no clear “valley / knee / elbow.” In that case, you're effectively choosing a trade-off. Experience often favors the lower end of the plausible range (smaller eps) to avoid merging clusters. [2]

When DBSCAN becomes annoying

  • High-dimensional data: selecting eps becomes difficult as distance contrast degrades. Alternatives like OPTICS / HDBSCAN* remove the need to choose a single global eps (though high dimensionality is still hard in general). [2]
  • Mixed densities: one global eps can cause sparse clusters to disappear or dense clusters to merge. [1][3]

References

  1. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD '96. (DBSCAN original paper) - PDF
  2. Schubert, E., Sander, J., Ester, M., Kriegel, H.-P., & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM TODS, 42(3). (Parameter heuristics + k-distance plots + high-dimensional caveats) - DOI - PDF
  3. scikit-learn documentation: sklearn.cluster.DBSCAN (eps/min_samples definitions, behavior notes, references). - Docs
Real-world applications: medicine, imaging, engineering, and science

DBSCAN's ability to find arbitrarily shaped clusters and explicitly identify outliers makes it valuable across many domains. Below are representative applications organized by field.

Medicine & Healthcare

  • Disease outbreak detection: Identifying spatial clusters of disease cases (e.g., COVID-19 hotspots, cancer clusters) where the shape of the outbreak region is unknown and isolated cases should be flagged as noise.
  • Patient stratification: Grouping patients by clinical biomarkers (lab values, vitals) to discover phenotypes or subtypes of a disease without assuming how many groups exist.
  • Anomaly detection in ECG/EEG signals: After feature extraction, DBSCAN can identify abnormal heartbeat or brainwave patterns as noise points distinct from normal clusters.
  • Genomics: Clustering gene expression profiles or single-cell RNA-seq data to identify cell types or states, where cluster shapes in high-dimensional space are non-spherical.

Image Processing & Computer Vision

  • Image segmentation: Grouping pixels by color/texture similarity in LAB or RGB space to segment objects without predefined region shapes.
  • Object detection preprocessing: Clustering keypoints or feature descriptors (e.g., SIFT, ORB) to identify distinct objects in a scene.
  • LiDAR point cloud processing: Segmenting 3D point clouds from autonomous vehicles or drones to identify objects (cars, pedestrians, buildings) with irregular shapes.
  • Medical imaging: Detecting tumors or lesions in CT/MRI scans by clustering voxel intensities, where tumor boundaries are irregular.
  • Satellite imagery: Identifying land-use regions, urban sprawl patterns, or deforestation areas from multispectral image features.

Engineering & Manufacturing

  • Predictive maintenance: Clustering sensor readings (vibration, temperature, pressure) from industrial equipment to detect anomalous operating states before failure.
  • Quality control: Identifying defective products on an assembly line by clustering measurement data; outliers represent manufacturing defects.
  • Network intrusion detection: Clustering network traffic patterns to identify normal behavior clusters and flag anomalous (potentially malicious) traffic as noise.
  • Structural health monitoring: Analyzing strain gauge or accelerometer data from bridges, buildings, or aircraft to detect damage patterns.
  • Semiconductor manufacturing: Clustering wafer test data to identify process drifts or equipment issues affecting chip yield.

Scientific Research

  • Astronomy: Identifying galaxy clusters, star clusters, or cosmic structures from survey data where cluster shapes are highly irregular.
  • Particle physics: Clustering particle tracks or energy deposits in detectors to reconstruct collision events and identify anomalous signatures.
  • Climate science: Grouping weather stations or grid cells by climate patterns to identify climate zones or detect unusual weather events.
  • Ecology: Clustering animal GPS tracks to identify home ranges, migration corridors, or unusual movement patterns indicating distress.
  • Chemistry/Materials science: Clustering molecular simulations or spectroscopy data to identify distinct molecular conformations or material phases.

Business & Social Science

  • Customer segmentation: Grouping customers by purchasing behavior or demographics without assuming the number of segments.
  • Fraud detection: Identifying unusual transaction patterns in banking or insurance as outliers from normal behavior clusters.
  • Social network analysis: Detecting communities in social graphs or identifying bot accounts as anomalies in user behavior feature space.
  • Urban planning: Clustering GPS or mobile phone data to identify activity centers, commuting patterns, or underserved areas.
Why DBSCAN for these applications?
  • No need to specify the number of clusters in advance
  • Can find clusters of arbitrary shape (not just spherical)
  • Built-in outlier detection (noise points)
  • Robust to outliers-they don't distort cluster centers
  • Deterministic results (unlike k-means with random initialization)
When to consider alternatives:
  • Varying density clusters: Use HDBSCAN or OPTICS instead
  • Very high dimensions: Consider dimensionality reduction first (PCA, UMAP)
  • Massive datasets: Use approximate methods or spatial indexing (this tool uses a KD-tree)
  • Streaming data: Consider incremental variants like DenStream
Technical Design Overview: Architecture & Implementation

This visualizer combines Rust/WebAssembly for high-performance clustering with JavaScript for UI orchestration and HTML5 Canvas for rendering. Clustering results are validated against sklearn.cluster.DBSCAN via Python backchecks.

Architecture

The application runs entirely in the browser across three layers: the UI layer (HTML/CSS) handles user input, visualization, and theming. The JavaScript orchestrator manages CSV parsing, data preparation, animation state, and coordinates calls to WebAssembly. The Rust/WASM compute core performs all the heavy lifting including KD-tree construction, DBSCAN clustering, core/border classification, and all metric calculations. Data passes between JavaScript and WASM via shared Float32Array buffers.

Separately, a Python validation script using scikit-learn verifies that WASM outputs match sklearn's DBSCAN exactly.

Technology Stack

Layer Technology Purpose
UI HTML5, CSS3, Canvas API User input, visualization, theming
Orchestrator JavaScript (ES6+) Data parsing, state management, WASM calls
Compute Core Rust -> WebAssembly Clustering, metrics, spatial indexing (KD-tree)
Validation Python + scikit-learn Correctness verification against sklearn

Performance

Rust/WASM provides roughly 10× speedup over pure JavaScript for clustering operations. On a 30,000-point dataset, JavaScript takes ~6,200ms while Rust/WASM completes in ~600ms. This is arbitrary since I put a lot more effort into the Rust code compared to the JS code.

Data Flow

CSV files are parsed in JavaScript, where users select numeric columns for clustering. The selected features are flattened into a row-major Float32Array and passed across the WASM boundary. The Rust core handles missing values (drop, impute, or fail), optionally standardizes features via z-score normalization, builds a KD-tree for efficient neighbor queries, runs DBSCAN, and returns cluster labels plus quality metrics back to JavaScript for rendering on the canvas.

Key Implementation Details

  • KD-Tree spatial index: O(n log n) average-case neighbor queries vs O(n^2) brute force.
  • f64 distance calculations: 64-bit float precision internally achieves 100% cluster assignment match with sklearn.
  • Missing value policies: Drop rows, impute with column mean, or fail on NaN.
  • Standardization: Optional z-score normalization computed on used rows only.

Clustering Quality Metrics

Computed in the WASM core:

  • Silhouette Score: Cluster cohesion and separation (-1 to +1)
  • Davies-Bouldin Index: Average cluster similarity (lower = better)
  • Calinski-Harabasz Index: Between/within cluster dispersion ratio (higher = better)
  • DBCV: Density-Based Clustering Validation, designed for DBSCAN
  • Noise Ratio & Cluster Size CV: Distribution diagnostics
Validation: Outputs are verified against sklearn.cluster.DBSCAN using permutation-invariant comparison. Expected: ARI >0.99, AMI >0.96, exact match rate >0.98.

Tradeoffs

Decision Benefit Cost
f32 storage, f64 compute Memory efficiency Cast overhead (~negligible)
Sampled Silhouette Fast for large datasets Slight variance
Single-threaded WASM Wider browser support No parallelism
KD-tree (not ball-tree) Simpler implementation Less optimal for high dims

Scalability

Points Time Recommendation
<10K <100ms Full metrics
10K-50K 100ms-1s Full metrics
50K-100K 1-5s Consider disabling DBCV
>100K >5s Consider chunking
Browser Compatibility: Requires WebAssembly (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+).

Version 2.1.1 - Feel free to find me on LinkedIn or email me at [email protected] if you'd like to discuss further.