Two-Tower Hybrid Embedding Networks for GRN Inference
Gene Regulatory Network Inference from Single-Cell Transcriptomics
Guessing which genes turn other genes on or off, using single-cell readings of what each cell is doing.
A pure-Rust two-tower MLP that learns entity embeddings and cell-type expression profiles to predict transcription factor–gene interactions: 83% ensemble accuracy, CPU-trainable without any deep learning framework.
Concept Overview
Abstract
Introduction
Inferring gene regulatory networks (GRNs) from single-cell RNA-seq data is a central challenge in computational biology. This paper proposes a two-tower multilayer perceptron that jointly learns entity embeddings for transcription factors and target genes alongside cell-type-specific expression profiles to predict regulatory interactions.
Cells are run by genes that switch each other on and off. Mapping those switches from single-cell data is a hard, central puzzle in biology. We try a simple setup: two small neural networks (one reads the controlling gene, one reads the gene being controlled), and they meet in the middle to score whether one likely controls the other. The model also takes in how active each gene is across different cell types.
Methods
Separate encoder towers process 512-dimensional learnable embeddings and 11-dimensional cell-type expression profiles through three fully connected layers with batch normalization and dropout. Interaction scores are produced via temperature-scaled dot-product similarity (τ = 0.05). The system is implemented entirely in Rust without external deep learning dependencies.
Each gene gets a numerical fingerprint, a list of 512 numbers the model learns on its own. Each cell type gets a smaller 11-number profile of gene activity. Both pass through three layers of a small neural network that smooths and stabilizes the signal. To check if one gene likely controls another, we measure how aligned the two fingerprints are, with a tuning knob (τ = 0.05) that sharpens the comparison. Everything runs in Rust, with no PyTorch or TensorFlow.
Results
On a human brain dataset with 47,388 TF–gene pairs, the single model achieves 80.14% accuracy and AUROC of 0.844. A 5-model ensemble reaches 83.06% accuracy. Hyperparameter optimization alone contributed 77% of total improvement (+16.1 pp), with cross-seed variance (CV = 2.06%) confirming model robustness.
We test on 47,388 gene pairings from human brain data. One model gets 80.14% right, with a quality score of 0.844 out of 1. Averaging 5 models pushes that to 83.06% accuracy. Most of the improvement (77%, or +16.1 percentage points) came from carefully tuning the model's settings, not fancy new tricks. Results barely shift when we re-run with different starting points (variance of 2.06%), so the model is steady.
Discussion
A competing cross-attention model (GCAN) achieved 91% accuracy but only 0.692 F1-score due to poor recall, underperforming on balanced prediction metrics. This highlights a critical accuracy–F1 paradox where high accuracy masks poor generalization under class imbalance.
A rival model, GCAN, looks better at first glance with 91% accuracy. But its F1 score (which measures how well it actually catches real cases) is only 0.692. It misses too many true regulatory links. So a flashy accuracy number can hide a model that fails when the data is uneven.
Conclusion
A carefully optimized standard MLP can achieve competitive GRN inference results (83%) while remaining CPU-trainable, challenging the assumption that complex architectures are necessary for strong performance on this task.
A plain, well-tuned neural network hits 83% on this task and trains on a regular laptop CPU. That pushes back on the idea that you need fancy, complicated models to do this well.
Introduction
Gene regulatory networks encode the interactions between transcription factors (TFs) and target genes that govern cellular identity and function. Reconstructing these networks from single-cell RNA-seq data is challenging due to the high dimensionality, sparsity, and cell-type heterogeneity of expression measurements.
Cells decide what to be (a neuron, a skin cell, a muscle cell) based on which genes are switched on. Some genes are the switches; the rest are the targets. The map of who controls whom is called a gene regulatory network. We try to rebuild that map from single-cell readings of gene activity. It is hard work: the data is enormous, mostly empty, and looks very different from one cell type to the next.
Key Motivation
Prior work on GRN inference often relies on complex graph neural networks or attention mechanisms. This study investigates whether a well-tuned, simple MLP can close the gap, with the added benefit of full CPU trainability and no framework dependencies.
Most prior work on this problem reaches for complex modern AI tools. We wanted to see if a simple, classic neural network, tuned with care, could keep up. As a bonus, our version runs on a normal CPU and uses no off-the-shelf AI frameworks.
This paper presents a two-tower architecture that separately encodes TF and gene identities through learned embeddings, while a second pathway encodes cell-type-specific expression profiles. The towers are combined via temperature-scaled dot-product similarity to produce regulatory link predictions.
The model has two parts that work side by side. One part looks at the switch gene, the other at the target gene. Each part turns the gene into a numerical fingerprint it learns from data. A second pathway adds context about how active each gene is in different cell types. The two fingerprints are then compared, and how closely they line up becomes the model's guess at whether one really controls the other.
Methods
Dataset & Regulatory Priors
Human brain single-cell RNA-seq data was paired with regulatory ground truth from the DoRothEA and TRRUST databases, yielding 47,388 TF–gene pairs split 70/15/15 for training, validation, and testing.
We start with single-cell gene activity readings from the human brain. For known answers, we use two public databases of confirmed gene-controlling-gene links: DoRothEA and TRRUST. That gives us 47,388 known pairings. We split them 70/15/15: most for teaching the model, the rest for checking and testing it.
Two-Tower Architecture
Each tower consists of three fully connected layers with batch normalization and dropout. The entity tower processes 512-dimensional learnable embeddings for TFs and genes; the expression tower processes 11-dimensional cell-type mean expression profiles.
Each tower is a small three-layer neural network with two stabilizing tricks built in. One tower turns each gene into a 512-number fingerprint the model learns. The other tower reads an 11-number summary of how active that gene is across different cell types. Side by side, they form the two-tower setup: one tower for the switch gene, one for the target gene.
Similarity Scoring & Training
Tower outputs are combined via temperature-scaled dot-product similarity (τ = 0.05) with sigmoid activation. Training uses Adam optimizer with L2 regularization (λ = 0.01), learning rate 5×10³, and early stopping with 10-epoch patience.
To score a pair, we measure how aligned the two fingerprints are, with a sharpness knob (τ = 0.05) to tune how decisive the score is. That score is then squeezed into a 0-to-1 probability. Training uses Adam, a well-known method for nudging the model toward better answers, plus a smoothing penalty (λ = 0.01) and learning rate 5×10³. We stop training once the model stops improving for 10 rounds.
Ensemble & Evaluation
Five independently trained models are aggregated by averaging predicted probabilities. Cross-seed evaluation across five random initializations measures variance (CV = 2.06%). Metrics include accuracy, F1, AUROC, and a comparison against GCAN (cross-attention baseline).
We train five separate copies of the model and average their guesses, the way a panel of judges is more reliable than one. Five random starting points show how much the results jitter (only 2.06%, so very little). We compare against a more complex rival called GCAN and report accuracy, F1, and AUROC, three different ways of measuring how often the model gets it right.
Results
Performance is evaluated on 47,388 human brain TF–gene pairs across single and ensemble configurations.
We score the model on 47,388 brain gene pairings, both as a single model and as a five-model panel.
Model Performance Comparison
Single model vs ensemble vs GCAN baseline on human brain scRNA-seq data.
GCAN's 91% accuracy masks a 0.692 F1-score driven by poor recall. The two-tower ensemble is more reliable on balanced prediction despite a lower headline accuracy.
Accuracy Attribution
Hyperparameter tuning alone contributed +16.1 percentage points (77% of total improvement). Ensemble aggregation added +2.9 pp and expression features +1.8 pp.
Most of the gain came from one boring-sounding thing: carefully tweaking the model's settings. That alone added +16.1 percentage points, or 77% of the total improvement. Averaging five models added another +2.9 points, and feeding in cell-type activity added +1.8 more.
Discussion
The results challenge the common narrative that GRN inference requires increasingly complex architectures. A carefully tuned two-tower MLP, implemented without any deep learning framework, achieves competitive accuracy on a challenging human brain dataset.
There is a common belief in this field that mapping gene controls needs ever more complicated AI. These results suggest otherwise. A simple two-tower model, carefully tuned and built from scratch with no AI library, holds its own on a tough human brain dataset.
The Accuracy–F1 Paradox
GCAN's 91% accuracy is inflated by biased recall. When evaluated on balanced metrics, the two-tower model outperforms it, illustrating why accuracy alone is insufficient for imbalanced biological datasets.
GCAN's 91% accuracy is inflated because it plays it safe and misses many real cases. On fairer measures that punish those misses, our two-tower model beats it. The lesson: in biology, where positive examples are rare, raw accuracy can be misleading.
The pure-Rust implementation introduces a useful engineering constraint: no automatic differentiation, no GPU assumed, full determinism. This forced explicit numerical decisions (temperature scaling, L2 regularization) that ultimately contributed to model stability, evidenced by a cross-seed CV of only 2.06%.
Writing the model from scratch in Rust came with constraints: no helper libraries doing the math for us, no fancy graphics card assumed, and every run gives the same result. Those limits forced careful choices about how the model handles numbers. That care paid off: results barely move between runs (only 2.06% jitter).
Conclusion
This study demonstrates that a standard, well-tuned MLP implemented in pure Rust can achieve 83% ensemble accuracy on GRN inference from single-cell RNA-seq data, competitive with significantly more complex architectures.
A plain neural network, carefully tuned and written in pure Rust, can map gene controls from single-cell data with 83% accuracy. That holds its own against far more complicated models.
The dominant contribution of hyperparameter optimization over architectural complexity suggests that future work on GRN inference should prioritize thorough tuning before escalating model complexity. The CPU-trainable implementation also makes the approach accessible to researchers without GPU infrastructure.
Tuning the model's settings matters far more than piling on complexity. Future work in this area should tune carefully before reaching for fancier models. And since this runs on a regular CPU, researchers without expensive graphics cards can use it too.
The accuracy–F1 paradox observed in the GCAN comparison is a cautionary note for benchmarking in computational biology: headline accuracy on imbalanced datasets can be misleading, and balanced metrics like F1 and AUROC should be primary evaluation criteria.
The GCAN comparison is a warning to anyone benchmarking biology models. A high accuracy number can hide a model that misses most of the real cases. Fairer measures like F1 and AUROC should be the primary scorecards.