Research Paper

GRASP: Graph-Routed Adaptive Spectral Partitioning

An Autonomous Topology-Aware Clustering System with Oracle-Guided Routing

GRASP is a parameter-free clustering pipeline that constructs a topological fingerprint of any dataset, estimates the natural cluster count automatically, and dispatches to the most suitable specialist algorithm — without requiring labels, validation data, or user-specified hyperparameters.

26 benchmark datasets
140 ms for 50K points
6 specialist algorithms

Problem

No single clustering algorithm is universally optimal — the No Free Lunch Theorem formalises this. The optimal choice depends critically on the geometric and statistical properties of the dataset, which are precisely unknown in unsupervised settings. Existing approaches either require user-specified hyperparameters or rely on heuristic ensemble methods.

Approach

GRASP presents a parameter-free autonomous pipeline that identifies the input distribution's geometric structure, estimates k via normalised spectral gap, and delegates the final assignment to one of six specialists — KMeans++, TopSOM, NnSom, WardAgglom, KnnSpectral, or RandNystrom — chosen by an oracle that maps topological and Euclidean diagnostic signals to the algorithm most likely to achieve Bayes-optimal partitioning.

Results

Evaluated on 26 benchmark datasets spanning non-convex shapes, SIPU S-sets and A-sets, UCI real-world data, and high-dimensional settings. GRASP achieves ARI 0.378 on interlocking rings against DBSCAN's ARI 0.000, passes all 49 unit tests, and processes 50,000 points within 140ms including full GNG training and Nyström embedding.

Contribution

A topology-prior partitioning principle, geometry-adaptive oracle routing, and selective spectral deployment — three design principles that together produce an interpretable, parameter-free system whose oracle converges to the Bayes-optimal specialist as n → ∞.

The fundamental difficulty in clustering is that the optimal algorithm depends on geometric and statistical properties of the dataset that are precisely unknown in the unsupervised setting. In practice, practitioners execute several algorithms in succession, selecting among them via visual inspection or labelled validation data — neither of which is available in genuinely unsupervised analysis.

The Normalised Spectral Gap

A topology-faithful graph makes cluster structure computationally accessible. The normalised spectral gap δk = λk+1k+2 is large when the graph decomposes cleanly into well-separated subgraphs — and small when cluster boundaries are topologically entangled. GRASP uses this gap to both estimate k and calibrate the oracle's routing decisions.

GRASP's three design principles separate it from existing methods. First, topology-prior partitioning constructs a topology map before committing to any partitioning strategy, at cost O(nm) independent of k. Second, geometry-adaptive routing selects the specialist on the basis of diagnostics extracted from the topology map — not heuristics external to the pipeline. Third, selective spectral deployment applies full spectral computation only when topological evidence indicates Euclidean methods are insufficient, keeping the typical cost at O(nm).

Full spectral clustering incurs O(n²) memory for the kernel matrix and O(n³) for eigendecomposition — prohibitive at n ≥ 10,000. GRASP's Nyström extension with GNG landmarks reduces this to O(nmk) with data-proportional coverage, enabling spectral-quality clustering at linear cost.

01

Stage 1 — Convex-Bypass Probe

On a random subsample X̂ of size ns = 800, GRASP runs r = 2 independent KMeans++ restarts with k = khint, computes the best mean silhouette ŝ on the best labelling. If ŝ > 0.55, all n points are assigned by nearest-centroid and the pipeline terminates at cost O(nskd). This fast convex-bypass saves 16 of 26 benchmark datasets, completing in 2–22ms.

Silhouette Probe O(nskd) Fast Bypass
02

Stage 2 — Growing Neural Gas Topology

GRASP trains a GNG on X using min(n, 8000) points for T = 50 epochs. GNG inserts new neurons at locations of maximal quantisation error, yielding a topology map whose connectivity emerges from data. The Competitive Hebbian Learning theorem guarantees the GNG edge set is a subgraph of the Delaunay triangulation of the neuron positions — providing a formal topological fidelity guarantee. Edges older than dmax are pruned every λ steps.

Growing Neural Gas Delaunay Subgraph O(nmT)
03

Stage 3 & 4 — Graph Laplacian & Eigengap k*

The bandwidth σ = median({‖wi−wj‖}) is estimated from GNG edge lengths. The sparse affinity matrix Aij = exp(−‖wi−wj‖²/2σ²) is built, and the normalised Laplacian L = I−D−1/2AD−1/2 is computed at constant cost in m. GRASP then estimates k* via the normalised gap δi = (λi+1−λi)/max(λk+1, ε), setting k* = arg max δi.

L = I − D−1/2AD−1/2,   δi = (λi+1−λi) / max(λk+1, ε) k* = arg maxi δi — robust to spectral scale by construction
Normalised Laplacian Eigengap Automatic k
04

Stage 5 — Nyström Embedding

Given GNG eigenvectors U ∈ ℝm×k*, each data point x is embedded as φ(x) = ∑i=1m K(x, wi) / √(Dx · di) Uij. Each row of φ(x) is ℓ²-normalised. Because GNG neurons concentrate in proportion to data density, they systematically cover all cluster regions, whereas random landmarks may miss thin or low-density structures. The Rust implementation distributes the loop across all cores via Rayon.

Nyström Extension GNG Landmarks O(nmk)
05

Stage 6 — Six Specialist Sub-Algorithms

GRASP instantiates six specialists, each targeting a different geometric regime:

  • KMeans++ — O(nkd), on original features. Best for approximately Gaussian clusters.
  • TopSOM — O(nm²), Laplacian → Eigengap → Nyström → KMeans. Full spectral pipeline on GNG topology.
  • GNG-DenSom — O(nm), density-watershed segmentation on the GNG graph.
  • WardAgglom — O(n², n ≤ 1000), Ward linkage on the Nyström embedding.
  • KnnSpectral — Full k-NN spectral clustering on original features.
  • RandNystrom — O(nm), Nyström with GNG landmarks, KMeans final step.
6 Specialists Spectral & Euclidean Density-Based
06

Stage 7 — Routing Oracle

The oracle maps three diagnostic signals to an algorithm choice: sts (TopoSOM silhouette), skm (KMeans silhouette), and spectral advantage Δ. The routing logic encodes four cases: Case 2.5 (high-confidence Gaussian, ρ > 0.70 and skm > 0.60 → KMeans); Case 2 (low ARI signals non-convex geometry → KnnSpectral/WardAgglom/TopoSOM); Cases 3a/3a′ (explicit spectral advantage → TopoSOM for large graphs); Case 3b (silhouette fallback → max(sts, skm, srn)).

Δ = sTopoSOM − sKMeans Spectral advantage Δ > 0.20 with p > 0.05 routes to TopoSOM
Oracle Routing Silhouette Signals Asymptotic Consistency

GRASP was evaluated on 26 benchmark datasets in six categories: non-convex shapes (moons, circles, spirals, interlocking rings, swiss_roll_2d, anisotropic, varied_density; n = 150–1000), SIPU S-sets s1–s4, A-sets a1–a3, UCI real-world (Iris, Wine, Breast Cancer, Digits), and scalability datasets at n = 50,000.

Interlocking Rings — GRASP ARI 0.378

On the interlocking rings benchmark where cluster boundaries are maximally topologically entangled: DBSCAN ARI 0.055, HDBSCAN ARI 0.055, KnnSpectral ARI 0.234 — GRASP achieves ARI 0.378 via TopoSOM routing, a 14× improvement over DBSCAN and 60% over KnnSpectral.

Non-convex benchmarks show the clearest advantage: GNG correctly captures the ring and spiral topology, the Laplacian eigengap identifies k = 2 or k = 3, and the oracle routes to TopoSOM which recovers the true partition via spectral embedding. KMeans++ achieves mean ARI 0.188 on these datasets; GRASP achieves 0.376.

A-sets and S-sets (overlapping Gaussian clusters): The convex-bypass probe correctly activates on S4 (tight clusters, silhouette > 0.55) and routes to KMeans on A1 (ρ = 0.862). On A3 (heavily overlapping, k = 50), GRASP routes to KMeans++ and achieves ARI 0.216 vs KMeans++ at 0.188.

Scalability: GNG, Laplacian, and eigendecomposition stages collectively incur O(m³) operations, entirely independent of n. The sole n-dependent step is the Nyström extension O(nmk); at n = 50,000 this completes in ~140ms. Standard SC requires O(n²d) at n = 10,000 this is 1012 operations — GRASP's O(nm) is a 500× reduction at comparable quality.

GRASP's three main theoretical results establish conditions under which each component achieves its design goal.

Theorem 2 — Spectral Cluster Recovery

If X is drawn from k distributions with compact support satisfying minj≠l d(Cj, Cl) ≥ Δ > 0, and the GNG mean edge length σ < Δ/2, then L has exactly k zero eigenvalues whose eigenvectors span a subspace in which the k cluster indicator vectors are mutually orthogonal.

Theorem 3 (Nyström error): For rank-m Nyström approximation K̃ with GNG landmarks, ‖K−K̃‖F ≤ λm+1(K) √(n−m). The eigenvector error for the leading k components is O(λk+1/(λk−λk+1)), which diminishes as the spectral gap increases — exactly the regime where GRASP routes to TopSOM.

Theorem 4 (Delaunay subgraph, Frötzke 2009): The GNG edge set is a subgraph of the Delaunay triangulation of the neuron positions. Hence any topologically separable cluster boundary present in the data's Delaunay structure is also represented in the GNG graph, bounding the Nyström approximation quality from below.

Proposition 5 (Oracle consistency): As n → ∞ with fixed cluster counts, the routing oracle selects the Bayes-optimal specialist with probability approaching 1. The oracle uses ARI estimate ρ, silhouette coefficients, and spectral advantage Δ — all of which converge by law of large numbers, and eigenvector stability of symmetric operators stabilises at a fixed label corresponding to the asymptotically optimal specialist.

GRASP demonstrates that autonomous, parameter-free clustering is achievable through principled geometric diagnosis rather than heuristic ensemble voting. The three-principle architecture — topology prior, geometry-adaptive routing, selective spectral deployment — produces a system that is simultaneously faster than full spectral methods, more robust than density methods on complex topology, and more accurate than KMeans++ on non-convex structure.

The fast convex-bypass terminates 16 of 26 benchmarks in under 22ms. For the remaining 10, the full pipeline achieves state-of-the-art accuracy while processing 50,000 points within 140ms — scaling as O(nm) in n, independent of the O(m³) topology computation which depends only on the GNG size m.

Future directions include extending GRASP's routing oracle with learned diagnostic features via a small neural classifier trained on the topology signals, integrating SOM-TSK's multi-start seeding as a replacement for KMeans++ in the KMeans specialist, and evaluating GRASP on single-cell RNA-seq datasets where cluster topology is biologically meaningful and ground truth is available via cell-type markers.

Related Research