Modular vs Monolithic Architectures for GRN Edge Prediction
Gradient Stability Analysis and a Controlled Cross-Architecture Comparison
A controlled comparison of modular two-tower models against monolithic cross-encoders for GRN inference — diagnosing three critical gradient failures in the two-tower design and demonstrating that the cross-encoder outperforms it, especially under class imbalance.
Abstract
Introduction
This paper investigates the design trade-offs between modular two-tower models (using separate pathways and similarity scoring) and monolithic cross-encoders (processing pairs jointly through MLPs) for predicting transcription factor–gene interactions in gene regulatory networks.
Methods
Three critical gradient failures in the original two-tower implementation are identified and corrected: double-sigmoid in the backward pass, unstable gradient estimators, and missing gradient clipping. Both architectures are then evaluated on balanced and 5:1 imbalanced training configurations using accuracy, F1, and AUROC over five random seeds with bootstrap confidence intervals.
Results
After gradient correction, the cross-encoder achieves 83.03% accuracy vs the two-tower's 80.90%, with AUROC 0.904 vs 0.810 under balanced training. Under 5:1 imbalance, the two-tower degrades −6.87 pp while the cross-encoder degrades only −1.56 pp. Each gradient failure alone collapses training accuracy to ~50%.
Discussion
The gradient failures explain a substantial portion of the two-tower's underperformance relative to the cross-encoder. Even after correction, the cross-encoder's joint processing of TF–gene pairs provides a fundamentally better inductive bias for interaction prediction than independent encoding followed by similarity scoring.
Conclusion
For GRN link prediction, monolithic cross-encoders outperform modular two-tower models on both accuracy and imbalance robustness, particularly in realistic biological settings where negative examples far outnumber positives.
Introduction
Two-tower (dual-encoder) models are attractive for link prediction tasks because they can independently encode entities and support efficient nearest-neighbor retrieval at inference time. However, their modular design limits the model's ability to capture joint interactions between input pairs, which may be critical for biological link prediction.
Key Question
Does the modular separation of a two-tower model fundamentally limit its expressivity for GRN inference, or can the gap be closed by correcting implementation-level gradient failures?
This study diagnoses three previously unreported gradient failures in a state-of-the-art two-tower GRN model, corrects them, and then conducts a controlled comparison against a monolithic cross-encoder to isolate the effect of architecture from implementation bugs.
Methods
Gradient Failure Diagnosis
Three critical failures were identified in the original two-tower backward pass: (1) double-sigmoid producing vanishing gradients, (2) numerically unstable gradient estimator, and (3) missing gradient clipping causing explosive updates. Each failure individually collapses training to ~50% accuracy.
Architecture Definitions
The corrected two-tower model uses separate entity and expression encoders combined via dot-product similarity. The cross-encoder concatenates TF and gene features before processing through a shared MLP. Both use per-batch Adam with gradient clipping (threshold: 5.0) and stable numerics.
Controlled Evaluation
Both architectures are evaluated on human brain scRNA-seq data (47,388 TF–gene pairs) under two training regimes: balanced (1:1 positive:negative) and imbalanced (5:1). Metrics include accuracy, F1, and AUROC over five random seeds with bootstrap confidence intervals.
Results
After gradient correction, the cross-encoder consistently outperforms the corrected two-tower across all metrics and both training regimes.
Architecture Comparison (Balanced Training)
Cross-encoder vs two-tower on human brain scRNA-seq — both with corrected gradients.
The cross-encoder's joint input processing provides a fundamentally stronger inductive bias for GRN link prediction, remaining robust under the class imbalance ratios typical in real biological datasets.
Gradient Failure Impact
Each of the three gradient failures individually collapsed training to approximately 50% accuracy — no better than random guessing. Diagnosing and correcting all three was a prerequisite for a fair architecture comparison.
Discussion
The gradient stability analysis reveals that a substantial portion of the two-tower's apparent underperformance in prior work was attributable to implementation bugs rather than architectural limitations. After correction, the performance gap narrows but persists.
Imbalance as a Practical Concern
Real GRN datasets are highly imbalanced — regulatory interactions are sparse against the vast space of non-interacting pairs. The two-tower's −6.87 pp degradation at a mild 5:1 ratio suggests it is poorly suited for production biological applications without significant modifications.
The cross-encoder's monolithic joint processing naturally captures TF–gene co-expression patterns that the two-tower's independence assumption prevents. This architectural difference, not implementation quality, ultimately drives the AUROC gap of 0.094.
Conclusion
This study provides two contributions: a gradient failure diagnosis applicable to any two-tower model implemented without automatic differentiation, and a controlled architectural comparison showing that monolithic cross-encoders outperform modular two-tower models for GRN link prediction.
For practitioners, the gradient failure checklist (double-sigmoid, unstable estimator, missing clipping) is directly applicable to other Rust or custom-framework implementations. For researchers designing GRN inference architectures, the results favor joint encoding over independent encoding followed by similarity scoring.
Future work should examine whether the two-tower's efficiency advantages (precomputed entity representations) can be retained while closing the AUROC gap through architectural modifications such as late interaction or cross-attention scoring.