Towards causal and interpretable virtual cells

Mikel Hernaez*

CIMA University of Navarra

mhernaez [at] unav.es

Abstract

Recent advances in single-cell and perturbational genomics have created new opportunities to model cellular behavior under genetic and disease-associated perturbations. However, most representation learning approaches remain difficult to interpret biologically and often capture statistical associations rather than mechanistic (causal) structure. Here, we develop interpretable machine learning frameworks for modeling transcriptional dysregulation through biologically grounded and causal representation learning of cell states.

We first introduce NetActivity, an interpretable autoencoder that incorporates prior biological knowledge into its architecture to infer robust gene-set activity scores from transcriptomic data. By constraining latent representations around biological processes, NetActivity enables pathway-level interpretation of disease-associated transcriptional programs. We then extend this framework to single-cell perturbation data through SENA-discrepancy-VAE, a biologically informed causal representation-learning model that integrates Perturb-seq interventions with pathway-aware latent factors. This approach aims to recover causal pathway archetypes that explain how genetic perturbations propagate through transcriptional programs and reshape cellular phenotypes.

Together, these methods provide a step toward interpretable virtual cells: predictive and perturbable computational models that connect gene expression, biological processes, and causal mechanisms of disease.

Keywords: virtual cells, Causality, perturb-seq, single-cell