Functional Group Landscape
What is the functional group landscape of drug-like chemical space, and how does it relate to known structural trends in approved drugs?
Using Graph-Level Variational Graph Encoding and Self-Organising Maps
Graph-level variational encoding, stratified unsupervised clustering, and formal enrichment testing map how functional group composition varies across 249,455 ZINC15 drug-like molecules — with counterfactual QED analysis decomposing scoring artefacts from genuine chemical signals.
Evint Leovonzko, Callixta F. Cahyaningrum, Rachmania Ulwani — Department of Chemistry, Universitas Gadjah Mada
This study combines graph-level variational encoding, stratified unsupervised clustering, and formal enrichment testing to map how functional group composition varies across drug-like chemical space. Analysis of 249,455 ZINC15 molecules encoded via GAT-VGAE, stratified by QED, and clustered with autotuned SOMs.
Counterfactual QED analysis decomposes the >22-fold nitro prevalence gradient into 78% entailed (attributable to QED’s structural alert penalty) and 22% empirical (genuine physicochemical disfavour from elevated MW and LogP).
A high-QED cluster (n=1,615) shows 7.2-fold phenyl depletion with enrichment for saturated groups — thioethers, tertiary amines, ethers — demonstrating drug-likeness without aromatic dominance, though representing only 2% of high-QED space.
Sulfonamide–heterocycle–nitrile pharmacophore signatures recur across all five QED strata with non-random co-occurrence (p«0.001, χ²). Comparison with Morgan fingerprint baselines confirms 72% top-10 enrichment agreement, while VGAE provides 18% lower quantization error.
Functional groups identified via substructure matching (22 types, 96.3% agreement with RDKit), not learned by encoder. Bootstrap resampling (1,000 iterations) yields mean cluster stability of ARI = 0.68. Six ablation experiments confirm each pipeline component contributes to final resolution.
Functional groups — hydroxyl, carbonyl, amine, sulfonyl, and hundreds more — are the primary carriers of chemical reactivity, physicochemical properties, and biological activity. Each imparts characteristic properties: carboxylic acids introduce ionisability at physiological pH, amide bonds provide hydrogen bonding, aromatic rings contribute hydrophobicity, halogens modulate lipophilicity and metabolic resistance.
Drug-likeness — the degree to which a molecule’s properties are consistent with oral bioavailability — serves as a key filtering criterion in early-stage drug discovery. Lipinski’s Rule of Five provides binary pass/fail filtering (MW ≤500, LogP ≤5, HBD ≤5, HBA ≤10), but cannot quantify degree of drug-likeness. The Quantitative Estimate of Drug-likeness (QED) addresses this gap as a continuous 0–1 score integrating eight molecular descriptors, calibrated against 771 FDA-approved oral drugs.
Despite the recognised importance of functional groups, systematic large-scale analyses relating functional group composition to quantitative drug-likeness scores remain sparse. Traditional fingerprint methods flatten molecular topology into fixed-length bit vectors, losing the structural context that makes chemistry meaningful.
This study extends a prior analysis by the present authors applying a feed-forward autoencoder and Deep SOM to the same 249,455 molecules. That study found sp² hybridisation correlated negatively with drug-likeness (r = −0.45) while Fsp³ correlated positively (r = +0.26), but was limited by a flat feature vector, deterministic autoencoder, and aggregate atomic statistics rather than functional group analysis.
The present work advances to full molecular graph representations, graph attention network encoding, variational inference for smooth latent manifolds, and a 22-type functional group vocabulary with formal enrichment testing. The primary methodological contribution is counterfactual QED analysis, which disentangles scoring-function artefacts from genuine chemical signals.
What is the functional group landscape of drug-like chemical space, and how does it relate to known structural trends in approved drugs?
What are the quantitative functional group–property relationships, with effect sizes and confidence intervals, for QED, LogP, and synthetic accessibility?
Does unsupervised clustering resolve interpretable FG-level substructure? What is the relationship between aromatic character and drug-likeness?
To what extent are QED-stratified patterns artefacts of QED’s construction versus genuine chemical signals? Addressed through counterfactual QED decomposition — the study’s primary methodological contribution.
249,455 drug-like molecules from ZINC15 (Lipinski-compliant, commercially available), deduplicated by canonical SMILES. Each molecule represented as undirected graph G = (V, E) with 29-dimensional node features (atom type one-hot, degree, charge, hybridisation, aromaticity, ring membership, atomic mass, chirality) and 9-dimensional edge features (bond type, conjugation, ring membership, stereochemistry). Mean graph size: 23.2 atoms, 24.9 bonds; totalling 5.78M atoms and 6.21M bonds.
Substructure pattern matching identifies 22 functional group types across six categories: oxygen-containing (hydroxyl, carboxyl, ester, ether, ketone, aldehyde, epoxide), nitrogen-containing (primary/secondary/tertiary amine, amide, nitro, nitrile, imine), sulfur-containing (thiol, thioether, sulfonyl, sulfoxide), halogens, ring systems (phenyl, heterocycle), and phosphorus (phosphate). Priority-based two-pass algorithm resolves overlapping substructures. Validated at 96.3% agreement with RDKit (range: 91.8% ketone to 99.7% nitro).
Encoder: input projection (29→64) followed by three GAT layers with edge-aware attention, residual connections, and ReLU. Global attention pooling produces graph-level embeddings. Two parallel heads project to mean μ and log-variance (both ∈ ℝ¹&sup6;). Sampling via reparameterisation trick. Decoder reconstructs node features via three linear layers (16→64→128→29). Loss: MSE + βKL with β = 0.001 (selected via grid search to prevent posterior collapse). Trained 100 epochs with Adam, cosine annealing, batch size 128. Validation MSE = 0.0512 (training: 0.0505, Δ = 1.4%).
Molecules divided into five strata via automated valley detection on QED distribution (breakpoints at 0.399, 0.520, 0.694, 0.814). Per-stratum SOM trained on 16-dim latent embeddings with autotuned grid size (10×10 to 30×30, composite score of QE + TE + ActiveRatio). Training: 128 epochs, Gaussian neighbourhood, K-Means++ init. Clusters characterised by functional group census, enrichment ratios with 95% CIs, Fisher’s exact test with Benjamini–Hochberg FDR correction (α = 0.05). Total: 68,836 enrichment tests with three safeguards.
QED recomputed with the structural alert component removed (QEDno-alert), using geometric mean of seven non-alert desirability functions. Each molecule re-stratified using same valley-detection algorithm. Decomposes observed gradients into entailed components (attributable to QED’s construction) and empirical components (genuine chemical signals). Primary methodological contribution of the study.
Full pipeline in Rust (2021 edition) using Burn 0.16 (wgpu/Metal GPU backend), petgraph 0.7, rayon for parallelism. Training on Apple M2 Ultra (76-core GPU) completed in ~45 minutes. Throughput: 5,496 molecules/second. Visualisation via UMAP-rs with plotters SVG output in colorblind-safe palettes. Fixed random seed (42) for primary run; metrics reported as mean ± std over 5 seeds.
The dataset comprises 249,455 molecules with mean QED = 0.728 ± 0.140, mean LogP = 2.457 ± 1.434, and mean SAS = 3.053 ± 0.835. All 22 functional group types were detected. Prevalence is highly skewed: phenyl (83.0%), amide (68.0%), and heterocycle (58.0%) dominate, while nitro (4.3%), carboxyl (3.8%), and phosphate (0.1%) are rare.
83.0% of molecules contain ≥1 phenyl ring (mean 2.42 rings/molecule). 96.2% contain at least one ring system. Phenyl accounts for 604,208 total occurrences — followed by heterocycle (338,210), amide (221,905), halide (135,900), and ether (122,335).
Nitro shows strongest negative QED correlation (r = −0.321). Phenyl dominates LogP (r = +0.397) and SAS (r = −0.427). Amide has strongest SAS association (r = −0.290), reflecting amide coupling as the most-used C–N bond-forming reaction. No single binary FG–property correlation exceeds |r| = 0.43.
Reconstruction MSE = 0.0505 (43% error reduction over per-atom-type baseline). Dimension 0 carries highest variance (σ² = 0.952) encoding phenyl presence (|r| = 0.543). Dimensions 5 and 7 show strongest QED associations (r = +0.322, +0.315). Negligible overfitting (Δ = 1.4% train/val gap).
Automated valley detection yields five strata: S0 (n=6,830, QED <0.40), S1 (n=17,622), S2 (n=60,427), S3 (n=83,673), S4 (n=80,903, QED >0.81). Quantization error drops 33% from S0 (1.10) to S4 (0.74), indicating high-QED molecules occupy progressively more compact latent space.
Cluster 0 (n=1,615, QED̄ = 0.847, 2.0% of S4): 7.2-fold phenyl depletion (11.5% vs. 83.0% overall). Enriched for thioether (1.89×, 95% CI [1.62, 2.20]), tertiary amine (1.81×, CI [1.58, 2.07]), ether (1.41×, CI [1.24, 1.60]). Identifies saturated 3D scaffolds — piperidines, morpholines, tetrahydropyrans — achieving drug-likeness without aromatic dominance.
Sulfonyl-enriched clusters (1.5–2.1×, all padj < 0.01) consistently co-enrich for nitrile (1.3–1.8×), imine (1.4–2.6×), and heterocyclic nitrogen (1.2–1.5×) across all five strata. Non-random co-occurrence confirmed (p « 0.001, Yates-corrected χ²). Matches known pharmacophore of ATP-competitive kinase inhibitors and sulfonamide antibacterials.
Largest high-QED cluster (n=3,347, QED̄ = 0.871). Carboxyl enrichment 3.67× [3.3, 4.1], halide 1.51× [1.4, 1.6], sulfonyl 1.50× [1.3, 1.7]. Challenges the assumption carboxylic acids preclude oral bioavailability — many approved drugs (ibuprofen, valsartan, atorvastatin) use anionic carboxylate for salt-bridge target engagement.
Nitro prevalence drops monotonically: S0 = 29.6%, S1 = 17.5%, S2 = 9.2%, S3 = 4.6%, S4 = 1.3% (>22-fold reduction). Counterfactual analysis decomposes this: ~78% attributable to QED’s structural alert penalty, ~22% reflects genuine physicochemical disfavour (nitro-bearing molecules have +34 Da MW and +0.8 LogP at matched ring count).
Cluster 600 (n=302, LogP = −0.19): primary amine 3.8× (padj < 10−12), hydroxyl 3.7× (padj < 10−10), carboxyl 2.5× (padj < 10−6). Identifies amino acid derivatives that fail QED not because pharmacologically inert, but because QED penalises low LogP and high polar surface area — a known limitation for transporter-mediated drugs like gabapentin.
1024-bit Morgan fingerprints (ECFP4) reduced to 16 dims via PCA produce 18% higher QE. ARI between methods: 0.34 (different but overlapping partitions). 72% of top-10 enrichments shared. Broad patterns robust to embedding method, but Morgan/PCA cannot resolve phenyl-depleted Cluster 0 as distinct (1,615 molecules split across 8 clusters with enrichment <2.0×).
The functional group census reveals a striking aromatic bias: 83.0% of 249,455 ZINC15 molecules contain at least one phenyl ring, with a mean of 2.42 aromatic rings per molecule. This dominance reflects the well-documented bias toward flat, sp²-rich scaffolds in commercial chemical libraries.
Excessive aromaticity impairs drug-likeness through three interconnected mechanisms: each additional aromatic ring increases LogP ~0.5 units, reducing aqueous solubility; electron-rich π-systems are preferential substrates for CYP-mediated oxidative metabolism (particularly CYP1A2, CYP3A4), reducing oral bioavailability; and flat aromatic surfaces promote crystal packing and π-stacking, reducing dissolution rates and increasing plasma protein binding.
Stratum 4’s Cluster 0 provides granular evidence for the “escape from flatland” hypothesis: 7.2-fold phenyl depletion with enrichment for saturated groups indicates piperidine, morpholine, and tetrahydropyran scaffolds. However, this cluster represents only 2.0% of Stratum 4 (n=1,615 of 80,903), so phenyl-depleted drug-likeness remains uncommon — likely reflecting synthetic accessibility bias, as flat aromatic scaffolds dominate vendor catalogues via well-established cross-coupling reactions.
LogP decreases with increasing QED (S0 mean = 3.21 → S4 mean = 1.97; Δ = −1.24, p < 10−10). Conversely, SAS increases with QED (S0 mean = 2.52 → S4 mean = 3.48; Δ = +0.96). Molecule-level correlation is moderate (r = +0.31, R² = 0.096) — substantial within-stratum variance implies “Pareto-efficient” molecules achieving high QED without proportionate SAS penalties.
Quantization error decreases monotonically from Stratum 0 (QE = 1.10) to Stratum 4 (QE = 0.74) — a 33% reduction. Topographic error drops 44%, and U-matrix maxima decrease 33%. Together, these metrics indicate drug-like molecules occupy a progressively more compact latent space region with smoother inter-cluster transitions. Stratum 4 has the smoothest internal topology (U-max = 0.074), suggesting the high-QED latent surface can serve as a navigation tool for scaffold hopping via centroid interpolation.
Six ablation experiments confirm each component’s contribution. Deterministic AE (β=0) achieves lower reconstruction loss but fragments latent space (+12% QE, 76% FG agreement). GCN encoder degrades both metrics. Fixed 10×10 SOM shows largest degradation (64% FG agreement). However, 15×15 grids reproduce all principal findings within 15% enrichment ratios.
1,000 bootstrap iterations (80% subsample) yield mean ARI = 0.68 (range: 0.61 for S0 to 0.74 for S4). Highlighted clusters have per-cluster Jaccard stability 0.72 (Cluster 0) and 0.78 (Cluster 899). 5-fold cross-validation yields 82% mean concordance for top-10 enriched FGs per stratum. Moderate stability — clusters are locally enriched neighbourhood partitions, not sharply defined subpopulations.
(1) Aromatic ring reduction viable but uncommon (2% of high-QED space). (2) Sulfonamide pharmacophores are QED-robust and combinatorially predictable across all strata. (3) 3.67× carboxyl enrichment in S4’s largest cluster challenges the assumption carboxylic acids preclude oral bioavailability. (4) Nitro 78%/22% decomposition quantifies known structural alert contribution.
No external bioactivity validation (most significant limitation). ZINC15 biased toward synthetically tractable scaffolds. QED calibrated on 771 historical oral drugs, doesn’t account for PROTACs or non-oral administration. 22-type FG vocabulary misses boronic acids, azetidines, covalent warheads. Detection is stereo-agnostic. VGAE decoder broadcasts single code to all nodes. Cluster stability moderate (ARI = 0.68).
This work contributes an integrated analysis pipeline — graph-level variational encoding, stratified SOM clustering, formal enrichment testing — and applies it to 249,455 ZINC15 molecules. The primary methodological contribution is counterfactual QED decomposition, disentangling scoring-function artefacts from genuine chemical patterns: the >22-fold nitro prevalence gradient decomposes into 78% entailed by QED’s alert penalty and 22% empirical.
Phenyl-depleted drug-likeness (n=1,615, 2% of high-QED space) and recurring sulfonamide pharmacophore co-occurrence quantitatively confirm established observations with effect sizes, confidence intervals, and FDR-controlled significance. Broad patterns robust to embedding method (72% top-10 agreement with Morgan fingerprints), while VGAE provides finer resolution (18% lower QE). Cluster stability moderate (ARI = 0.68, 1,000 bootstrap iterations).
The study is descriptive; no bioactivity validation was performed. The critical next step is overlaying ChEMBL bioactivity data to test whether the structural organisation identified here predicts shared pharmacological activity. Additional priorities include extending to the COCONUT natural product database, incorporating 3D conformer generation, and using stratified SOM centroids as conditioning variables for constrained molecular generation.
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S., & Hopkins, A. L. (2012). Quantifying the chemical beauty of drugs. Nature Chemistry, 4(2), 90–98.
Lipinski, C. A., Lombardo, F., Dominy, B. W., Feeney, P. J. (1997). Experimental and computational approaches to estimate solubility and permeability in drug discovery. Advanced Drug Delivery Reviews, 23(1–3), 3–25.
Sterling, T., & Irwin, J. J. (2015). ZINC 15 — Ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11), 2324–2337.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. ICLR 2018.
Kipf, T. N., & Welling, M. (2016). Variational graph auto-encoders. NeurIPS Workshop on Bayesian Deep Learning.
Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Kohonen, T. (2001). Self-Organizing Maps. Springer Series in Information Sciences, Vol. 30.
Ertl, P., & Schuffenhauer, A. (2009). Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 1, 8.
Lovering, F., Bikker, J., & Humblet, C. (2009). Escape from flatland: Increasing saturation as an approach to improving clinical success. Journal of Medicinal Chemistry, 52(21), 6752–6756.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
Meanwell, N. A. (2018). Fluorine and fluorinated motifs in the design and application of bioisosteres. J. Med. Chem., 61(14), 5822–5880.
Supuran, C. T. (2017). Special issue: Sulfonamides. Molecules, 22(10), 1642.