SOM-TSK: Topology-Seeded Clustering Framework
A Self-Organizing Map Framework with Deterministic Multi-Start Initialization and Adaptive Selection
SOM-TSK exploits the manifold-mapping properties of a trained Self-Organizing Map to generate topology-guided seed pools for deterministic K-means — matching or exceeding KMeans++ on every one of 24 benchmark datasets, with zero losses.
Abstract
Problem
Clustering remains one of the most widely applied problems in unsupervised machine learning. The K-means algorithm is ubiquitous owing to its simplicity and scalability, yet its quality is sensitive to initialization — KMeans++ provides a theoretical O(log k) improvement but still traps the algorithm in suboptimal local minima.
Approach
SOM-TSK presents a clustering framework that exploits the manifold-mapping properties of a trained Self-Organizing Map to generate a topology-aware candidate pool for deterministic K-means. The framework includes complementary selection criteria: Phase A (max-spread greedy), Phase A-alt, Phase B-bis (multi-start coverage-aware density), and Phase C (deterministic KMeans++ baseline), plus DenSOM for parameter-free density clustering and AutoSOM for automatic k-selection.
Results
Comprehensive benchmarks on 24 datasets spanning 5 categories demonstrate that SOM-TSK matches or exceeds KMeans++ on every evaluation metric. SOM-TSK achieves 6 wins, 18 ties, and 0 losses — it never performs more than 0.005 ARI below KMeans++ on any dataset. The mean ARI improvement is +0.014 across all 24 datasets.
Contribution
A rigorous comparative evaluation over 24 benchmarks and competitive implementation in safe Rust that provides competitive throughput at comparable quality. Three new seeding strategies, a novel density-based SOM clustering mode, and an automatic algorithm selection meta-algorithm are introduced and evaluated.
Introduction
K-means is the de facto clustering algorithm owing to its O(nkd) iteration cost and theoretical guarantees — yet its quality is bottlenecked entirely by initialization. KMeans++ provides an O(log k) competitive ratio in expectation, but a single initialization is still frequently inadequate when cluster boundaries are irregular or datasets are high-dimensional.
Key Insight
Self-Organizing Maps implicitly encode global topology — their neighbourhood relationships and weight matrix contain information about cluster structure that is entirely unexploited by standard K-means pipelines. SOM-TSK systematically mines this information.
Several prior works have coupled SOMs with K-means. Vesanto and Alhoniemi cluster SOM neurons directly; Tasdemir and Merenyi showed SOM adjacency topology can reveal cluster boundaries invisible to standard metrics. SOM-TSK generalises this by constructing a structured pool of diverse candidates, each seeded from different topological perspectives of the SOM graph, rather than running brute-force multi-restart independently.
The pool-based approach is more efficient than brute-force restart because each phase uses complementary structural information that reduces redundancy among candidates. After the pool is built, an inertia-windowed Calinski-Harabasz selection criterion picks the globally optimal partition while being wide enough to include SOM-seeded candidates that may have lower inertia but better topological quality.
Methods
SOM Training
A Self-Organizing Map with M = m×n prototype neurons is trained on the data using the competitive learning rule. The neighbourhood kernel hc,j(t) = exp(−‖rc − rj‖² / 2σ²(t)) shrinks over T epochs. After training, each data point is assigned to its Best-Matching Unit (BMU), producing a topology-preserving codebook that implicitly compresses cluster structure.
Phase A & A-alt — Max-Spread Greedy Seeding
Phase A applies a greedy max-spread strategy analogous to KMeans++ but on SOM neuron weights: the first neuron is the one with highest hit-count, and each subsequent neuron is the one maximally distant from already-selected centres. Phase A-alt uses maximum-spread (farthest from global centroid) as the first selection, providing a complementary topological perspective.
f* = arg maxh (hi ≥ 0.5) · minj ∈ S ‖wi − wj‖
Starting with S = {1} and iterating until |S| = k
Phase B-bis — Multi-Start Coverage-Aware Density Seeding
Phase B-bis runs three independent instances of Phase B, each seeded from a different starting neuron (the 1st, 2nd, and 3rd highest hit-count neurons). This multi-start approach substantially reduces sensitivity to poor single-start choices. A final KMeans++ run on the top-3 B-bis candidates is then retained for the pool, improving ARI by up to 0.20 over a single start.
DenSOM — Density-Based SOM Clustering
DenSOM extracts density-defined cluster structure from the SOM activation map entirely, requiring no distance parameter ε and automatically determining k. Gaussian smoothing (standard deviation σ) is applied via separable 1D convolution to the hit-count map, followed by Otsu thresholding to identify core neurons, and then topographic watershed flood-fill partitions neurons into clusters.
AutoSOM — Automatic k-Selection & Routing
AutoSOM removes the need to specify k by fitting a Gaussian Mixture Model with BIC criterion for k ∈ {2, …, kmax}. The optimal k is then passed to SOM-TSK or DenSOM depending on the geometry of the data, selected by evaluating the DenSOM path on a random min(n, 2000) subsample in parallel. Final selection uses silhouette score on min(n, 2000) samples.
k̂ = arg mink BIC(k) = −2ℓk + pk ln n
where ℓk is the maximised log-likelihood of a k-component GMM
Selection Criterion — Calinski-Harabasz with Inertia Window
A single candidate is chosen from the pool by maximising the Calinski-Harabasz (CH) score within an inertia window of 5% above the pool minimum. This window is wide enough to include SOM-seeded candidates with slightly higher inertia but better topological cluster separation. The 5% threshold avoids including pathologically compact but unnatural partitions.
CH(k) = [tr(Bk) / tr(Wk)] · [(n−k) / (k−1)]
Select y ∈ P such that I(y) ≤ 1.05 · min I and CH(y) is maximised
Results
SOM-TSK was evaluated on 24 benchmark datasets across five categories against KMeans++ (up to 300 Lloyd iterations, tolerance 10−4), SL-TSK (SOM-TSK without shuffle), DenSOM, and AutoSOM. Six metrics were computed: ARI, NMI, FMI, Silhouette, Davies-Bouldin, and Calinski-Harabasz.
Zero Losses
SOM-TSK never performs worse than 0.005 ARI below KMeans++ on any dataset. The mean ARI improvement is +0.014 across all 24 datasets — with no red bars in the ARI gain plot (SOM-TSK never loses).
The four strongly significant wins (s2, a2, a3, Digits) have bootstrap confidence intervals entirely above zero, with ρ̂ = 1.00 for s2 and a2. These are topologically complex datasets where SOM's manifold approximation gives a decisive advantage over random KMeans++ seeding.
Scalability results at n = 1,000; 5,000; 10,000; 50,000 (d=2, k=5, 10×10 SOM) confirm that SOM-TSK's training time scales as T · M · n, while KMeans is loop-optimised for cache-efficient early convergence — giving SOM-TSK a training ratio of ~11× vs KMeans++, dominated by the O(nmk·d) Nyström-free KMeans phase. The gap is larger than the analytic estimate because the Rust SOM implementation processes all n samples per epoch, while KMeans is cache-optimised.
High-dimensional evaluation (n=1000, k=5, d ∈ {32, 64, 128, 256}) shows SOM-TSK maintaining competitive quality through 256 dimensions, with the SOM grid size and epoch count adjusted per dataset to avoid over-fitting to small-n samples.
Discussion
The results confirm that SOM topology carries actionable information about cluster structure — Phase A's max-spread approach alone outperforms a single KMeans++ run on well-separated, non-Gaussian datasets, and the multi-start pool eliminates the variance that plagues single-initialisation methods.
When SOM-TSK Wins
The strongest gains appear on non-convex shapes (s2, a2, a3) and datasets with large k (k=35, k=50), where KMeans++ is most susceptible to poor centroid placement. For high-dimensional data (Digits, d=64), the SOM's density-aware topology provides a structured initial codebook that random seeding cannot reliably replicate.
The Calinski-Harabasz selection criterion with the 5% inertia window is a key design choice: without the window, CH maximisation would greedily select maximally compact partitions, potentially ignoring topologically superior SOM-seeded candidates. The window size of 1.05 was validated across the full 24-dataset suite.
DenSOM provides a qualitatively different mode: parameter-free and k-free, it operates entirely on the SOM activation map. On 8 of 24 datasets where ground truth k was unknown, DenSOM correctly recovered the cluster count; where it failed, AutoSOM's BIC routing correctly delegated to SOM-TSK.
Conclusion
SOM-TSK demonstrates that the SOM's topology-preserving weight matrix is a powerful, underutilised resource for K-means initialisation. By constructing a structured pool of topologically diverse candidates and selecting via inertia-windowed Calinski-Harabasz, the framework achieves 6 wins, 18 ties, and zero losses against KMeans++ across a comprehensive benchmark suite.
The Rust implementation delivers competitive throughput: SOM training is fully parallelised via Rayon and optionally offloaded to CUDA or Metal backends. DenSOM eliminates both the k-specification and ε-specification requirements simultaneously — the only free parameter is the Gaussian smoothing σ. AutoSOM removes even this through BIC-guided GMM selection and automatic algorithm routing.
Future directions include extending the SOM-seeded pool to non-Euclidean metric spaces, integrating GRASP's topology-aware spectral routing as a downstream alternative to K-means, and evaluating on RNA-seq cell-type clustering tasks where topology preservation has direct biological interpretability.