Research Paper

Modular vs Monolithic Architectures for GRN Edge Prediction

Gradient Stability Analysis and a Controlled Cross-Architecture Comparison

Should you build one big AI model, or two smaller ones that work side by side? A close look at where the smaller setup quietly breaks.

A controlled comparison of modular two-tower models against monolithic cross-encoders for GRN inference. We diagnose three critical gradient failures in the two-tower design and demonstrate that the cross-encoder outperforms it, especially under class imbalance.

0.904 Cross-encoder AUROC
83.03% Cross-encoder accuracy
−6.87 pp Two-tower imbalance drop

Concept Overview

Introduction

This paper investigates the design trade-offs between modular two-tower models (using separate pathways and similarity scoring) and monolithic cross-encoders (processing pairs jointly through MLPs) for predicting transcription factor–gene interactions in gene regulatory networks.

Cells decide what they do based on which genes are switched on. A transcription factor is a gene whose job is to flip other genes on or off. The map of who controls whom is called a gene regulatory network, or GRN. This paper tests two ways for an AI to predict these links. The two-tower approach looks at each gene by itself, then asks: do these two feel similar? The cross-encoder looks at both genes together from the start.

Methods

Three critical gradient failures in the original two-tower implementation are identified and corrected: double-sigmoid in the backward pass, unstable gradient estimators, and missing gradient clipping. Both architectures are then evaluated on balanced and 5:1 imbalanced training configurations using accuracy, F1, and AUROC over five random seeds with bootstrap confidence intervals.

Before comparing fairly, we had to fix the two-tower's learning signal. AI models learn by nudging themselves in tiny steps. We found three bugs that quietly broke those nudges: the model squashed its own signal twice, used a shaky math shortcut, and let some steps grow too large. After the fixes, we tested both models on two settings: a fair 1:1 mix of real and fake pairs, and a 5:1 mix that looks more like real biology. We ran each test five times and measured accuracy, F1, and AUROC with confidence ranges.

Results

After gradient correction, the cross-encoder achieves 83.03% accuracy vs the two-tower's 80.90%, with AUROC 0.904 vs 0.810 under balanced training. Under 5:1 imbalance, the two-tower degrades −6.87 pp while the cross-encoder degrades only −1.56 pp. Each gradient failure alone collapses training accuracy to ~50%.

Even after the fixes, the bigger model wins. The cross-encoder gets 83.03% accuracy, the two-tower gets 80.90%. On AUROC (a score for how well it ranks real links above fake ones), the cross-encoder scores 0.904 versus 0.810. When real pairs become rare (5 fakes for every 1 real), the two-tower loses −6.87 pp of accuracy. The cross-encoder loses just −1.56 pp. And each of the three bugs alone is enough to crash training down to about 50%, a coin flip.

Discussion

The gradient failures explain a substantial portion of the two-tower's underperformance relative to the cross-encoder. Even after correction, the cross-encoder's joint processing of TF–gene pairs provides a fundamentally better inductive bias for interaction prediction than independent encoding followed by similarity scoring.

A lot of the two-tower's poor showing came from those three bugs, not from the design itself. But fixing them does not close the gap. Looking at both genes together from the start is simply a better fit for this problem than judging each one alone and then asking how similar they feel.

Conclusion

For GRN link prediction, monolithic cross-encoders outperform modular two-tower models on both accuracy and imbalance robustness, particularly in realistic biological settings where negative examples far outnumber positives.

For predicting gene control links, the single bigger model beats the split-into-two-pieces model. It is more accurate, and it holds up better when real links are rare, which is what real biology looks like, since most random gene pairs do not actually control each other.

Two-tower (dual-encoder) models are attractive for link prediction tasks because they can independently encode entities and support efficient nearest-neighbor retrieval at inference time. However, their modular design limits the model's ability to capture joint interactions between input pairs, which may be critical for biological link prediction.

Two-tower models are popular because each tower learns to describe its own input on its own. That separation makes search fast. You can prepare all the descriptions in advance and just look up matches later. The trade-off is that the two towers never truly meet. They cannot pick up on the back-and-forth between a pair of things, and in biology, that back-and-forth is where the answer lives.

Key Question

Does the modular separation of a two-tower model fundamentally limit its expressivity for GRN inference, or can the gap be closed by correcting implementation-level gradient failures?

Is the two-tower model held back by its split design? Or has it just been let down by buggy code, bugs that, once fixed, would let it catch up?

This study diagnoses three previously unreported gradient failures in a state-of-the-art two-tower GRN model, corrects them, and then conducts a controlled comparison against a monolithic cross-encoder to isolate the effect of architecture from implementation bugs.

This study finds three bugs in a leading two-tower model that nobody had spotted before, fixes them, and then sets up a head-to-head test against the bigger cross-encoder. The goal is to tell apart what the design itself contributes from what was just broken code.

01

Gradient Failure Diagnosis

Three critical failures were identified in the original two-tower backward pass: (1) double-sigmoid producing vanishing gradients, (2) numerically unstable gradient estimator, and (3) missing gradient clipping causing explosive updates. Each failure individually collapses training to ~50% accuracy.

Three things were going wrong with how the two-tower model learned. First, it squashed its own learning signal twice, so almost nothing was left to learn from. Second, it used a shaky math shortcut that gave noisy nudges. Third, it never put a cap on the size of those nudges, so some grew huge and blew everything up. Any one of these alone drags the model down to about 50% accuracy, a coin flip.

Gradient AnalysisNumerical StabilityDebugging
02

Architecture Definitions

The corrected two-tower model uses separate entity and expression encoders combined via dot-product similarity. The cross-encoder concatenates TF and gene features before processing through a shared MLP. Both use per-batch Adam with gradient clipping (threshold: 5.0) and stable numerics.

The fixed two-tower has one small network for the gene's identity and another for how active it is, then asks how well those two descriptions line up. The cross-encoder glues both genes into one input and runs them through a single network. Both models use the same learning recipe: Adam, a step-size cap of 5.0, and stable math.

Two-TowerCross-EncoderAdam
03

Controlled Evaluation

Both architectures are evaluated on human brain scRNA-seq data (47,388 TF–gene pairs) under two training regimes: balanced (1:1 positive:negative) and imbalanced (5:1). Metrics include accuracy, F1, and AUROC over five random seeds with bootstrap confidence intervals.

Both models are tested on real data from single cells in the human brain (47,388 TF–gene pairs). We train them two ways: a fair 1:1 mix of real and fake pairs, and a harder 5:1 mix where most pairs are fake. We score them on accuracy, F1, and AUROC, run each setup five times, and report confidence ranges.

Bootstrap CIImbalance Testing5 Seeds

After gradient correction, the cross-encoder consistently outperforms the corrected two-tower across all metrics and both training regimes.

Once the bugs were fixed, the cross-encoder still came out ahead. It wins on every score, and in both the easy and the hard training setups.

Architecture Comparison (Balanced Training)

Cross-encoder vs two-tower on human brain scRNA-seq, both with corrected gradients.

Model
Accuracy
AUROC
Imbalance Drop (5:1)
Cross-Encoder
Monolithic MLP
83.03%
0.904
−1.56 pp
Two-Tower (corrected)
Modular MLP
80.90%
0.810
−6.87 pp

The cross-encoder's joint input processing provides a fundamentally stronger inductive bias for GRN link prediction, remaining stable under the class imbalance ratios typical in real biological datasets.

Gradient Failure Impact

Each of the three gradient failures individually collapsed training to approximately 50% accuracy, no better than random guessing. Diagnosing and correcting all three was a prerequisite for a fair architecture comparison.

Each of the three bugs, by itself, drags the model down to about 50% accuracy, the same as flipping a coin. All three had to be found and fixed before the two designs could be compared honestly.

The gradient stability analysis reveals that a substantial portion of the two-tower's apparent underperformance in prior work was attributable to implementation bugs rather than architectural limitations. After correction, the performance gap narrows but persists.

The deeper look shows that a lot of the two-tower's bad reputation came from broken code, not a broken idea. After the fixes, the gap shrinks, but it does not go away.

Imbalance as a Practical Concern

Real GRN datasets are highly imbalanced. Regulatory interactions are sparse against the vast space of non-interacting pairs. The two-tower's −6.87 pp degradation at a mild 5:1 ratio suggests it is poorly suited for production biological applications without significant modifications.

Real gene networks are lopsided. Most random pairs of genes do not actually control each other, so true links are rare. The two-tower's −6.87 pp drop at a mild 5-to-1 ratio is a red flag. Without serious changes, it is not ready for real-world biology, where the lopsidedness is much worse.

The cross-encoder's monolithic joint processing naturally captures TF–gene co-expression patterns that the two-tower's independence assumption prevents. This architectural difference, not implementation quality, ultimately drives the AUROC gap of 0.094.

When the cross-encoder looks at both genes at once, it can spot when they switch on and off together, a pattern the two-tower is structurally blind to, because it studies each gene alone. That blind spot, not bad code, is what causes the 0.094 gap in AUROC scores.

This study provides two contributions: a gradient failure diagnosis applicable to any two-tower model implemented without automatic differentiation, and a controlled architectural comparison showing that monolithic cross-encoders outperform modular two-tower models for GRN link prediction.

This study offers two things. First, a checklist of learning-signal bugs that anyone hand-coding a two-tower model can use to debug their own work. Second, a fair head-to-head test that shows the bigger single-model design is the better fit for predicting gene control links.

For practitioners, the gradient failure checklist (double-sigmoid, unstable estimator, missing clipping) is directly applicable to other Rust or custom-framework implementations. For researchers designing GRN inference architectures, the results favor joint encoding over independent encoding followed by similarity scoring.

If you build models from scratch (in Rust or any custom setup), the three-bug checklist (double squashing, shaky math, no step-size cap) will save you headaches. If you design models to map gene networks, the lesson is clear: let the model look at both genes together from the start, rather than studying each one alone and comparing notes later.

Future work should examine whether the two-tower's efficiency advantages (precomputed entity representations) can be retained while closing the AUROC gap through architectural modifications such as late interaction or cross-attention scoring.

The two-tower is fast because it can do most of its work in advance. The next question is whether we can keep that speed but still close the accuracy gap, maybe by letting the two towers compare notes at the very end, instead of meeting only through a simple similarity score.

Related Research