Research Paper

Functional Group Analysis of Drug-Like Chemical Space

Using Graph-Level Variational Graph Encoding and Self-Organising Maps

Reading molecules like little maps to see which chemical pieces make a drug feel drug-like.

Graph-level variational encoding, stratified unsupervised clustering, and formal enrichment testing map how functional group composition varies across 249,455 ZINC15 drug-like molecules — with counterfactual QED analysis decomposing scoring artefacts from genuine chemical signals.

Evint Leovonzko, Callixta F. Cahyaningrum, Rachmania Ulwani — Department of Chemistry, Universitas Gadjah Mada

249,455 ZINC15 molecules

22 Functional group types

5,496 Molecules/sec throughput

View Source Code Start Reading

Research Snapshot

Pipeline details, co-authors, and headline metrics behind this cheminformatics study.

249,455 Molecules analysed

18% Lower QE vs. baseline

0.68 Bootstrap ARI stability

Authors Evint Leovonzko, Callixta F. Cahyaningrum, Rachmania Ulwani

Affiliation Department of Chemistry, Universitas Gadjah Mada, Yogyakarta, Indonesia

Pipeline GAT-VGAE (3-layer, 16-dim latent) → QED-stratified autotuned SOM → FDR-corrected enrichment — all in Rust with Burn 0.16.

Concept Overview

Abstract

Context

This study combines graph-level variational encoding, stratified unsupervised clustering, and formal enrichment testing to map how functional group composition varies across drug-like chemical space. Analysis of 249,455 ZINC15 molecules encoded via GAT-VGAE, stratified by QED, and clustered with autotuned SOMs.

Drug hunters sift through millions of molecules to find ones that could become medicines. We wanted a map showing which chemical building blocks tend to live where on that landscape. So we took 249,455 ZINC15 molecules, turned each one into a tiny graph with GAT-VGAE (a neural network that reads molecules), sorted them by QED — a 0-to-1 score for how drug-like a molecule looks — and grouped them using an auto-tuned SOM (Self-Organising Map — an algorithm that lays similar items near each other on a grid).

Finding 1: Counterfactual QED Decomposition

Counterfactual QED analysis decomposes the >22-fold nitro prevalence gradient into 78% entailed (attributable to QED’s structural alert penalty) and 22% empirical (genuine physicochemical disfavour from elevated MW and LogP).

Molecules with a nitro group (a common chemical flag) are 22 times rarer among drug-like molecules than you'd expect by chance. We asked: is that real, or just because QED already punishes nitro on purpose? The answer splits cleanly: 78% comes from QED's built-in nitro penalty, and 22% is genuinely the molecule's fault — nitro-bearing molecules tend to be heavier (higher MW) and oilier (higher LogP), which the body doesn't love.

Finding 2: Phenyl-Depleted Drug-Likeness

A high-QED cluster (n=1,615) shows 7.2-fold phenyl depletion with enrichment for saturated groups — thioethers, tertiary amines, ethers — demonstrating drug-likeness without aromatic dominance, though representing only 2% of high-QED space.

Most drug-like molecules lean on flat benzene rings (phenyl groups). But we found one cluster of 1,615 highly drug-like molecules with 7.2 times fewer of them. Instead, they're full of curvier, three-dimensional pieces — thioethers, tertiary amines, and ethers. The takeaway: you don't need flat aromatic rings to look drug-like. The catch: this cluster is only 2% of the high-QED pool, so the route works but few molecules take it.

Finding 3: Pharmacophore Signatures

Sulfonamide–heterocycle–nitrile pharmacophore signatures recur across all five QED strata with non-random co-occurrence (p«0.001, χ²). Comparison with Morgan fingerprint baselines confirms 72% top-10 enrichment agreement, while VGAE provides 18% lower quantization error.

A three-piece recipe — sulfonamide, heterocycle, and nitrile — keeps showing up together across every drug-likeness band, far more often than chance allows (p«0.001). To check our work, we re-ran the analysis with Morgan fingerprints (a classic, simpler molecule descriptor). They agreed on 72% of the top-10 results, while our graph-based pipeline drew a sharper map (18% lower quantization error — meaning molecules sit closer to where they belong).

Validation

Functional groups identified via substructure matching (22 types, 96.3% agreement with RDKit), not learned by encoder. Bootstrap resampling (1,000 iterations) yields mean cluster stability of ARI = 0.68. Six ablation experiments confirm each pipeline component contributes to final resolution.

We spotted functional groups by direct pattern matching, not by trusting the neural network to learn them (22 types, agreeing with the standard tool RDKit 96.3% of the time). To check the clusters were stable, we re-shuffled the data 1,000 times: the clusters held together with ARI (Adjusted Rand Index — a 0-to-1 agreement score) = 0.68. Six knock-out experiments confirmed every part of the pipeline pulls its weight.

Introduction

Functional groups — hydroxyl, carbonyl, amine, sulfonyl, and hundreds more — are the primary carriers of chemical reactivity, physicochemical properties, and biological activity. Each imparts characteristic properties: carboxylic acids introduce ionisability at physiological pH, amide bonds provide hydrogen bonding, aromatic rings contribute hydrophobicity, halogens modulate lipophilicity and metabolic resistance.

Think of a molecule as a body and functional groups as its working parts — hydroxyl, carbonyl, amine, sulfonyl, and many more. Each part does a specific job. Carboxylic acids carry a charge in the body. Amides stick to water and proteins. Aromatic rings repel water. Halogens (chlorine, fluorine and friends) tweak how oily a molecule is and how fast the body breaks it down. Swap one part and the whole molecule behaves differently.

Drug-likeness — the degree to which a molecule’s properties are consistent with oral bioavailability — serves as a key filtering criterion in early-stage drug discovery. Lipinski’s Rule of Five provides binary pass/fail filtering (MW ≤500, LogP ≤5, HBD ≤5, HBA ≤10), but cannot quantify degree of drug-likeness. The Quantitative Estimate of Drug-likeness (QED) addresses this gap as a continuous 0–1 score integrating eight molecular descriptors, calibrated against 771 FDA-approved oral drugs.

"Drug-likeness" asks a simple question: could this molecule survive being swallowed as a pill? It's an early filter in drug discovery, before anyone spends months in a lab. The classic test is Lipinski's Rule of Five (MW ≤500, LogP ≤5, HBD ≤5, HBA ≤10), but it only gives a yes or no. QED upgrades that to a smooth 0–1 score, blending eight molecular features and tuned against 771 FDA-approved oral drugs — so you can ask how drug-like, not just whether.

Research Gap

Despite the recognised importance of functional groups, systematic large-scale analyses relating functional group composition to quantitative drug-likeness scores remain sparse. Traditional fingerprint methods flatten molecular topology into fixed-length bit vectors, losing the structural context that makes chemistry meaningful.

Everyone agrees functional groups matter, but few studies have linked them, at scale, to drug-likeness scores. Older methods compress a molecule into a row of bits, like turning a building into a barcode. Useful for fast lookup, but the shape and connections that give chemistry its meaning get thrown away.

This study extends a prior analysis by the present authors applying a feed-forward autoencoder and Deep SOM to the same 249,455 molecules. That study found sp² hybridisation correlated negatively with drug-likeness (r = −0.45) while Fsp³ correlated positively (r = +0.26), but was limited by a flat feature vector, deterministic autoencoder, and aggregate atomic statistics rather than functional group analysis.

We've looked at this same set of 249,455 molecules before, with a simpler neural network. That earlier work spotted a pattern: flat, aromatic-style atoms (sp²) hurt drug-likeness (r = −0.45), while 3D, saturated atoms (Fsp³) helped (r = +0.26). But it counted atoms in bulk, not by functional group, and used a flatter representation — so it could see the trend but not say which chemical pieces were driving it.

The present work advances to full molecular graph representations, graph attention network encoding, variational inference for smooth latent manifolds, and a 22-type functional group vocabulary with formal enrichment testing. The primary methodological contribution is counterfactual QED analysis, which disentangles scoring-function artefacts from genuine chemical signals.

This time we treat each molecule as a full graph (atoms as dots, bonds as lines) and use a graph attention network that pays attention to how parts connect. We add a small dose of randomness during training so the model learns a smooth, navigable map. We track 22 specific functional groups and run proper statistical tests on which ones cluster together. The headline new trick is counterfactual QED analysis, which tells you when a pattern is real chemistry versus a quirk of how QED is calculated.

Question 1

Functional Group Landscape

What is the functional group landscape of drug-like chemical space, and how does it relate to known structural trends in approved drugs?

Census Prevalence Distribution

Question 2

FG–Property Relationships

What are the quantitative functional group–property relationships, with effect sizes and confidence intervals, for QED, LogP, and synthetic accessibility?

Correlations QED LogP & SAS

Question 3

Cluster Substructure

Does unsupervised clustering resolve interpretable FG-level substructure? What is the relationship between aromatic character and drug-likeness?

SOM Clustering Aromaticity Enrichment

Question 4

Entailed vs. Empirical

To what extent are QED-stratified patterns artefacts of QED’s construction versus genuine chemical signals? Addressed through counterfactual QED decomposition — the study’s primary methodological contribution.

Counterfactual Decomposition Novel Method

Methodology

Step 1: Dataset & Molecular Graph Construction

249,455 drug-like molecules from ZINC15 (Lipinski-compliant, commercially available), deduplicated by canonical SMILES. Each molecule represented as undirected graph G = (V, E) with 29-dimensional node features (atom type one-hot, degree, charge, hybridisation, aromaticity, ring membership, atomic mass, chirality) and 9-dimensional edge features (bond type, conjugation, ring membership, stereochemistry). Mean graph size: 23.2 atoms, 24.9 bonds; totalling 5.78M atoms and 6.21M bonds.

We pulled 249,455 drug-like molecules from ZINC15, a public catalogue of compounds you can actually buy. Duplicates were removed. Each molecule becomes a graph: dots for atoms, lines for bonds. Every dot carries 29 facts about that atom (what element it is, how many neighbours it has, charge, shape, whether it's in a ring, mass, and chirality). Every line carries 9 facts about the bond (type, conjugation, ring membership, stereochemistry). The average molecule has 23.2 atoms and 24.9 bonds — 5.78M atoms and 6.21M bonds across the dataset.

ZINC15 249,455 Molecules 29-dim Node Features 9-dim Edge Features

Step 2: Functional Group Detection

Substructure pattern matching identifies 22 functional group types across six categories: oxygen-containing (hydroxyl, carboxyl, ester, ether, ketone, aldehyde, epoxide), nitrogen-containing (primary/secondary/tertiary amine, amide, nitro, nitrile, imine), sulfur-containing (thiol, thioether, sulfonyl, sulfoxide), halogens, ring systems (phenyl, heterocycle), and phosphorus (phosphate). Priority-based two-pass algorithm resolves overlapping substructures. Validated at 96.3% agreement with RDKit (range: 91.8% ketone to 99.7% nitro).

We hunt for 22 functional group types by pattern-matching atoms and bonds, sorted into six families: oxygen-based (hydroxyl, carboxyl, ester, ether, ketone, aldehyde, epoxide), nitrogen-based (primary/secondary/tertiary amine, amide, nitro, nitrile, imine), sulfur-based (thiol, thioether, sulfonyl, sulfoxide), halogens, ring systems (phenyl, heterocycle), and phosphorus (phosphate). A two-pass priority algorithm decides who wins when patterns overlap. We checked our hit list against RDKit (the standard cheminformatics tool) and agreed 96.3% of the time — 91.8% on ketones, 99.7% on nitros.

22 FG Types Two-Pass Matching 96.3% RDKit Agreement

Step 3: GAT-VGAE Encoding

Encoder: input projection (29→64) followed by three GAT layers with edge-aware attention, residual connections, and ReLU. Global attention pooling produces graph-level embeddings. Two parallel heads project to mean μ and log-variance (both ∈ ℝ¹&sup6;). Sampling via reparameterisation trick. Decoder reconstructs node features via three linear layers (16→64→128→29). Loss: MSE + βKL with β = 0.001 (selected via grid search to prevent posterior collapse). Trained 100 epochs with Adam, cosine annealing, batch size 128. Validation MSE = 0.0512 (training: 0.0505, Δ = 1.4%).

The model is a neural network that reads a molecule's graph and squeezes it into a short summary. First, atom info is widened from 29 numbers to 64. Then three GAT layers let each atom "look around" at its neighbours and decide which ones matter, with edge information mixed in. A pooling step compresses the whole molecule into one summary, which is then expressed as a mean and a spread (both 16-dimensional). A small decoder tries to rebuild the atoms from that summary — if it can, the summary is good. Loss = MSE + βKL with β = 0.001, picked by grid search to keep the model from collapsing. Trained for 100 epochs with Adam and cosine annealing, batch size 128. Validation MSE = 0.0512 versus 0.0505 on training (Δ = 1.4% — barely any overfitting).

3-Layer GAT 16-dim Latent β = 0.001 MSE = 0.051

Step 4: QED-Stratified SOM Clustering

Molecules divided into five strata via automated valley detection on QED distribution (breakpoints at 0.399, 0.520, 0.694, 0.814). Per-stratum SOM trained on 16-dim latent embeddings with autotuned grid size (10×10 to 30×30, composite score of QE + TE + ActiveRatio). Training: 128 epochs, Gaussian neighbourhood, K-Means++ init. Clusters characterised by functional group census, enrichment ratios with 95% CIs, Fisher’s exact test with Benjamini–Hochberg FDR correction (α = 0.05). Total: 68,836 enrichment tests with three safeguards.

We split the molecules into five drug-likeness bands using natural dips in the QED distribution as cut-offs (at 0.399, 0.520, 0.694, 0.814). Each band gets its own SOM, a 2D grid where similar molecules sit close together. Grid size auto-tunes between 10×10 and 30×30 to fit each band best. Training: 128 passes through the data, with K-Means++ for a good start. Inside each cluster, we count functional groups, calculate how much more common each one is than expected (with 95% confidence intervals), and run Fisher's exact test — the gold standard for "is this real?". Benjamini–Hochberg FDR correction (α = 0.05) keeps false positives in check across 68,836 tests.

5 QED Strata Autotuned SOM Fisher’s Exact Test 68,836 Tests

Step 5: Counterfactual QED Analysis

QED recomputed with the structural alert component removed (QED_no-alert), using geometric mean of seven non-alert desirability functions. Each molecule re-stratified using same valley-detection algorithm. Decomposes observed gradients into entailed components (attributable to QED’s construction) and empirical components (genuine chemical signals). Primary methodological contribution of the study.

Here's the twist that drives our main finding. QED's recipe includes a penalty for "structural alerts" — chemical groups already known to be trouble. So if nitro looks rare in high-QED molecules, is that real, or just because QED was told to dislike nitro? We rebuild QED without that alert penalty (QED_no-alert), re-sort the molecules, and see what stays. What survives is empirical — real chemical signal. What disappears was entailed — baked in by QED itself. This is the study's main methodological contribution.

Counterfactual Scoring Entailed vs. Empirical Novel Method

Implementation

Full pipeline in Rust (2021 edition) using Burn 0.16 (wgpu/Metal GPU backend), petgraph 0.7, rayon for parallelism. Training on Apple M2 Ultra (76-core GPU) completed in ~45 minutes. Throughput: 5,496 molecules/second. Visualisation via UMAP-rs with plotters SVG output in colorblind-safe palettes. Fixed random seed (42) for primary run; metrics reported as mean ± std over 5 seeds.

The whole pipeline is written in Rust (a fast, memory-safe language) with Burn 0.16 for neural networks (running on Apple's Metal GPU), petgraph for the graph code, and rayon to spread work across CPU cores. On an Apple M2 Ultra, training finishes in about 45 minutes — chewing through 5,496 molecules every second. Plots use UMAP-rs and a colourblind-safe palette. We fix the random seed (42) for the headline run and report mean ± std across 5 seeds.

Rust + Burn 0.16 wgpu/Metal ~45 min Training 5,496 mol/s

Results

The dataset comprises 249,455 molecules with mean QED = 0.728 ± 0.140, mean LogP = 2.457 ± 1.434, and mean SAS = 3.053 ± 0.835. All 22 functional group types were detected. Prevalence is highly skewed: phenyl (83.0%), amide (68.0%), and heterocycle (58.0%) dominate, while nitro (4.3%), carboxyl (3.8%), and phosphate (0.1%) are rare.

The dataset has 249,455 molecules. Average QED is 0.728 ± 0.140 (most are reasonably drug-like). Average LogP is 2.457 ± 1.434 (a measure of oiliness — balanced, not too greasy). Average SAS (Synthetic Accessibility Score — how hard a molecule is to make in a lab, 1 easy to 10 hard) is 3.053 ± 0.835. All 22 functional groups appear, but very unevenly: phenyl rings (83.0%), amides (68.0%), and heterocycles (58.0%) are everywhere, while nitros (4.3%), carboxyls (3.8%), and phosphates (0.1%) are rare.

Functional Group Census

Aromatic Dominance of ZINC15

83.0% of molecules contain ≥1 phenyl ring (mean 2.42 rings/molecule). 96.2% contain at least one ring system. Phenyl accounts for 604,208 total occurrences — followed by heterocycle (338,210), amide (221,905), halide (135,900), and ether (122,335).

83% Phenyl 68% Amide 58% Heterocycle 2.42 Rings/mol

FG–Property Correlations

Key Structure–Property Relationships

Nitro shows strongest negative QED correlation (r = −0.321). Phenyl dominates LogP (r = +0.397) and SAS (r = −0.427). Amide has strongest SAS association (r = −0.290), reflecting amide coupling as the most-used C–N bond-forming reaction. No single binary FG–property correlation exceeds |r| = 0.43.

Nitro: r(QED) = −0.32 Phenyl: r(LogP) = +0.40 Amide: r(SAS) = −0.29

VGAE Encoding

Latent Space Quality

Reconstruction MSE = 0.0505 (43% error reduction over per-atom-type baseline). Dimension 0 carries highest variance (σ² = 0.952) encoding phenyl presence (|r| = 0.543). Dimensions 5 and 7 show strongest QED associations (r = +0.322, +0.315). Negligible overfitting (Δ = 1.4% train/val gap).

MSE = 0.051 43% Error Reduction 16-dim Latent

QED Stratification

Five-Stratum Structure

Automated valley detection yields five strata: S0 (n=6,830, QED <0.40), S1 (n=17,622), S2 (n=60,427), S3 (n=83,673), S4 (n=80,903, QED >0.81). Quantization error drops 33% from S0 (1.10) to S4 (0.74), indicating high-QED molecules occupy progressively more compact latent space.

S0: n = 6,830 S4: n = 80,903 33% QE Reduction

Headline Enrichment Findings

Stratum 4 · High QED

Phenyl-Depleted Drug-Likeness

Cluster 0 (n=1,615, QED̄ = 0.847, 2.0% of S4): 7.2-fold phenyl depletion (11.5% vs. 83.0% overall). Enriched for thioether (1.89×, 95% CI [1.62, 2.20]), tertiary amine (1.81×, CI [1.58, 2.07]), ether (1.41×, CI [1.24, 1.60]). Identifies saturated 3D scaffolds — piperidines, morpholines, tetrahydropyrans — achieving drug-likeness without aromatic dominance.

7.2× Phenyl Depletion Thioether 1.9× Tert. Amine 1.8× n = 1,615

Cross-Stratum Pattern

Sulfonamide Pharmacophore Recurrence

Sulfonyl-enriched clusters (1.5–2.1×, all p_adj < 0.01) consistently co-enrich for nitrile (1.3–1.8×), imine (1.4–2.6×), and heterocyclic nitrogen (1.2–1.5×) across all five strata. Non-random co-occurrence confirmed (p « 0.001, Yates-corrected χ²). Matches known pharmacophore of ATP-competitive kinase inhibitors and sulfonamide antibacterials.

5 Strata Recurrence p « 0.001 Kinase Inhibitor Motif

Stratum 4 · Cluster 899

Carboxylate-Bearing Drug-Like Molecules

Largest high-QED cluster (n=3,347, QED̄ = 0.871). Carboxyl enrichment 3.67× [3.3, 4.1], halide 1.51× [1.4, 1.6], sulfonyl 1.50× [1.3, 1.7]. Challenges the assumption carboxylic acids preclude oral bioavailability — many approved drugs (ibuprofen, valsartan, atorvastatin) use anionic carboxylate for salt-bridge target engagement.

COOH 3.7× Halide 1.5× n = 3,347 NSAID-like

Cross-Stratum Gradient

>22-fold Nitro Depletion

Nitro prevalence drops monotonically: S0 = 29.6%, S1 = 17.5%, S2 = 9.2%, S3 = 4.6%, S4 = 1.3% (>22-fold reduction). Counterfactual analysis decomposes this: ~78% attributable to QED’s structural alert penalty, ~22% reflects genuine physicochemical disfavour (nitro-bearing molecules have +34 Da MW and +0.8 LogP at matched ring count).

29.6% → 1.3% 78% Entailed 22% Empirical +34 Da MW

Stratum 0 · Low QED

Polar Aliphatic Peptidomimetics

Cluster 600 (n=302, LogP = −0.19): primary amine 3.8× (p_adj < 10⁻¹²), hydroxyl 3.7× (p_adj < 10⁻¹⁰), carboxyl 2.5× (p_adj < 10⁻⁶). Identifies amino acid derivatives that fail QED not because pharmacologically inert, but because QED penalises low LogP and high polar surface area — a known limitation for transporter-mediated drugs like gabapentin.

NH&sub2; 3.8× OH 3.7× LogP = −0.19 Peptidomimetics

Baseline Comparison

VGAE vs. Morgan Fingerprints

1024-bit Morgan fingerprints (ECFP4) reduced to 16 dims via PCA produce 18% higher QE. ARI between methods: 0.34 (different but overlapping partitions). 72% of top-10 enrichments shared. Broad patterns robust to embedding method, but Morgan/PCA cannot resolve phenyl-depleted Cluster 0 as distinct (1,615 molecules split across 8 clusters with enrichment <2.0×).

72% Agreement 18% Higher QE ARI = 0.34 Finer Resolution

Discussion

The functional group census reveals a striking aromatic bias: 83.0% of 249,455 ZINC15 molecules contain at least one phenyl ring, with a mean of 2.42 aromatic rings per molecule. This dominance reflects the well-documented bias toward flat, sp²-rich scaffolds in commercial chemical libraries.

The headcount shows a strong tilt toward flat, ring-shaped chemistry: 83.0% of the 249,455 ZINC15 molecules carry at least one benzene-style ring, with an average of 2.42 such rings each. That's not a surprise — commercial chemical libraries lean heavily on these flat scaffolds because they're cheap and easy to make.

The Aromatic Dominance Problem

Excessive aromaticity impairs drug-likeness through three interconnected mechanisms: each additional aromatic ring increases LogP ~0.5 units, reducing aqueous solubility; electron-rich π-systems are preferential substrates for CYP-mediated oxidative metabolism (particularly CYP1A2, CYP3A4), reducing oral bioavailability; and flat aromatic surfaces promote crystal packing and π-stacking, reducing dissolution rates and increasing plasma protein binding.

Too many flat rings hurt a drug in three ways. First, each extra aromatic ring nudges LogP up by about 0.5, making the molecule oilier and harder to dissolve in water. Second, liver enzymes (especially CYP1A2 and CYP3A4) love to chew through aromatic rings, so the drug gets broken down before it can act. Third, flat surfaces stack like pancakes, making the molecule slow to dissolve and quicker to get stuck to blood proteins.

Stratum 4’s Cluster 0 provides granular evidence for the “escape from flatland” hypothesis: 7.2-fold phenyl depletion with enrichment for saturated groups indicates piperidine, morpholine, and tetrahydropyran scaffolds. However, this cluster represents only 2.0% of Stratum 4 (n=1,615 of 80,903), so phenyl-depleted drug-likeness remains uncommon — likely reflecting synthetic accessibility bias, as flat aromatic scaffolds dominate vendor catalogues via well-established cross-coupling reactions.

Cluster 0 in the highest-QED band is direct evidence for the “escape from flatland” idea: 7.2× fewer phenyl rings, more saturated 3D groups, pointing to piperidines, morpholines, and tetrahydropyrans (curvy, chair-shaped rings). But this cluster is only 2.0% of the top band (1,615 of 80,903 molecules), so this route to drug-likeness exists but is rare — mostly because suppliers stock flat aromatic scaffolds, which are easy to assemble with well-known cross-coupling reactions.

Drug-Likeness–Synthetic Accessibility Trade-off

LogP decreases with increasing QED (S0 mean = 3.21 → S4 mean = 1.97; Δ = −1.24, p < 10⁻¹⁰). Conversely, SAS increases with QED (S0 mean = 2.52 → S4 mean = 3.48; Δ = +0.96). Molecule-level correlation is moderate (r = +0.31, R² = 0.096) — substantial within-stratum variance implies “Pareto-efficient” molecules achieving high QED without proportionate SAS penalties.

There's a price tag on being drug-like. As QED climbs, molecules get less oily (LogP drops from 3.21 to 1.97; Δ = −1.24, p < 10⁻¹⁰) but harder to make in the lab (SAS rises from 2.52 to 3.48; Δ = +0.96). The link is moderate (r = +0.31), and there's wide variation within each band — so some molecules manage to be very drug-like without being painfully hard to synthesise. Those are the sweet spots worth chasing.

Topology of Drug-Like Chemical Space

Quantization error decreases monotonically from Stratum 0 (QE = 1.10) to Stratum 4 (QE = 0.74) — a 33% reduction. Topographic error drops 44%, and U-matrix maxima decrease 33%. Together, these metrics indicate drug-like molecules occupy a progressively more compact latent space region with smoother inter-cluster transitions. Stratum 4 has the smoothest internal topology (U-max = 0.074), suggesting the high-QED latent surface can serve as a navigation tool for scaffold hopping via centroid interpolation.

Drug-like molecules sit closer together on our map than non-drug-like ones, and the gaps between groups are gentler. Quantization error (how far molecules sit from their nearest cluster centre) drops 33% from the lowest QED band to the highest (1.10 down to 0.74). Topographic error falls 44%; U-matrix peaks (sharp ridges between clusters) drop 33%. The top band has the smoothest map (U-max = 0.074), so it's a good playground for “scaffold hopping” — gliding between related drug-like ideas by moving across the map.

Ablation Results

Component Validation

Six ablation experiments confirm each component’s contribution. Deterministic AE (β=0) achieves lower reconstruction loss but fragments latent space (+12% QE, 76% FG agreement). GCN encoder degrades both metrics. Fixed 10×10 SOM shows largest degradation (64% FG agreement). However, 15×15 grids reproduce all principal findings within 15% enrichment ratios.

6 Ablations β=0: +12% QE GCN: Degraded Fixed SOM: 64%

Cluster Stability

Bootstrap Validation

1,000 bootstrap iterations (80% subsample) yield mean ARI = 0.68 (range: 0.61 for S0 to 0.74 for S4). Highlighted clusters have per-cluster Jaccard stability 0.72 (Cluster 0) and 0.78 (Cluster 899). 5-fold cross-validation yields 82% mean concordance for top-10 enriched FGs per stratum. Moderate stability — clusters are locally enriched neighbourhood partitions, not sharply defined subpopulations.

ARI = 0.68 1,000 Iterations 82% CV Concordance Jaccard 0.72–0.78

Design Implications

Actionable Insights for Medicinal Chemistry

(1) Aromatic ring reduction viable but uncommon (2% of high-QED space). (2) Sulfonamide pharmacophores are QED-robust and combinatorially predictable across all strata. (3) 3.67× carboxyl enrichment in S4’s largest cluster challenges the assumption carboxylic acids preclude oral bioavailability. (4) Nitro 78%/22% decomposition quantifies known structural alert contribution.

Ring Reduction Sulfonamide QED-Robust Carboxylate Challenge Nitro Decomposition

Limitations

Study Constraints

No external bioactivity validation (most significant limitation). ZINC15 biased toward synthetically tractable scaffolds. QED calibrated on 771 historical oral drugs, doesn’t account for PROTACs or non-oral administration. 22-type FG vocabulary misses boronic acids, azetidines, covalent warheads. Detection is stereo-agnostic. VGAE decoder broadcasts single code to all nodes. Cluster stability moderate (ARI = 0.68).

No Bioactivity Data Commercial Bias Stereo-Agnostic Moderate ARI

Conclusion

This work contributes an integrated analysis pipeline — graph-level variational encoding, stratified SOM clustering, formal enrichment testing — and applies it to 249,455 ZINC15 molecules. The primary methodological contribution is counterfactual QED decomposition, disentangling scoring-function artefacts from genuine chemical patterns: the >22-fold nitro prevalence gradient decomposes into 78% entailed by QED’s alert penalty and 22% empirical.

We built a three-step pipeline — read each molecule as a graph, sort them into drug-likeness bands and group similar ones, then test which chemical pieces cluster together — and ran it on 249,455 ZINC15 molecules. The main new idea is counterfactual QED: a way to tell when a chemical pattern is real versus when QED's own rules created it. The classic example: nitro groups are 22× rarer in drug-like molecules; 78% of that is QED's built-in nitro penalty, and 22% is genuine chemistry pushing back.

Principal Findings

Phenyl-depleted drug-likeness (n=1,615, 2% of high-QED space) and recurring sulfonamide pharmacophore co-occurrence quantitatively confirm established observations with effect sizes, confidence intervals, and FDR-controlled significance. Broad patterns robust to embedding method (72% top-10 agreement with Morgan fingerprints), while VGAE provides finer resolution (18% lower QE). Cluster stability moderate (ARI = 0.68, 1,000 bootstrap iterations).

Two patterns chemists have suspected for years now have hard numbers behind them. You can be drug-like without flat aromatic rings (1,615 molecules show it, though that's only 2% of the high-QED pool). And the sulfonamide–heterocycle–nitrile combo really does recur with proper statistical backing (effect sizes, confidence intervals, FDR-controlled significance). The big patterns hold up no matter which method you use (72% top-10 agreement with classic Morgan fingerprints), and our graph approach draws a sharper map (18% lower QE). Cluster stability is moderate (ARI = 0.68 over 1,000 reshuffles).

The study is descriptive; no bioactivity validation was performed. The critical next step is overlaying ChEMBL bioactivity data to test whether the structural organisation identified here predicts shared pharmacological activity. Additional priorities include extending to the COCONUT natural product database, incorporating 3D conformer generation, and using stratified SOM centroids as conditioning variables for constrained molecular generation.

One important caveat: this study only describes structure. We haven't shown that any of these molecules actually do anything in a cell. The big next step is bringing in ChEMBL data — a public record of which molecules hit which biological targets — to test whether molecules sitting near each other on our map share real drug activity. After that: extend to natural products (COCONUT database), add 3D shape information, and use the cluster centres as guides to generate brand-new drug-like molecules.

References

Drug-Likeness

Quantitative Estimate of Drug-likeness (QED)

Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S., & Hopkins, A. L. (2012). Quantifying the chemical beauty of drugs. Nature Chemistry, 4(2), 90–98.

Nature Chemistry Drug-Likeness Score 771 Oral Drugs

Foundational

Lipinski’s Rule of Five

Lipinski, C. A., Lombardo, F., Dominy, B. W., Feeney, P. J. (1997). Experimental and computational approaches to estimate solubility and permeability in drug discovery. Advanced Drug Delivery Reviews, 23(1–3), 3–25.

Ro5 Oral Bioavailability Seminal Work

Database

ZINC15

Sterling, T., & Irwin, J. J. (2015). ZINC 15 — Ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11), 2324–2337.

Virtual Screening Drug Discovery 750M Compounds

Graph Neural Networks

Graph Attention Networks (GAT)

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. ICLR 2018.

Attention Mechanism Message Passing ICLR

Variational Inference

Variational Graph Auto-Encoders

Kipf, T. N., & Welling, M. (2016). Variational graph auto-encoders. NeurIPS Workshop on Bayesian Deep Learning.

VGAE Generative Models Latent Space

Clustering

Self-Organising Maps

Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Kohonen, T. (2001). Self-Organizing Maps. Springer Series in Information Sciences, Vol. 30.

SOM Topology Preservation Unsupervised Learning

Synthetic Accessibility

SAS Score

Ertl, P., & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 1, 8.

Synthetic Feasibility Molecular Complexity Fragment-Based

Chemical Space

“Escape from Flatland”

Lovering, F., Bikker, J., & Humblet, C. (2009). Escape from flatland: Increasing saturation as an approach to improving clinical success. Journal of Medicinal Chemistry, 52(21), 6752–6756.

Fsp³ Saturation Clinical Success

Statistics

FDR Correction

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.

Multiple Testing FDR Foundational

Medicinal Chemistry

Fluorine in Drug Design & Sulfonamides

Meanwell, N. A. (2018). Fluorine and fluorinated motifs in the design and application of bioisosteres. J. Med. Chem., 61(14), 5822–5880.
Supuran, C. T. (2017). Special issue: Sulfonamides. Molecules, 22(10), 1642.

Fluorine Bioisosteres Sulfonamide Drugs Metabolic Stability

Related Research

Back to Portfolio