Predicting variant pathogenicity by combining protein language models and biological features

Alexandre G. de Brevern*

INSERM, Université Paris Cité, Université de la Réunion, EFS

alexandre.debrevern [at] u-paris.fr

Abstract

Predicting the pathogenic impact of missense variants is a cornerstone of genetic disease interpretation and diagnosis. In practice, this is addressed using variant effect predictors (VEPs)—from classical tools such as SIFT to recent deep-learning systems such as AlphaMissense. We recently benchmarked 65 VEPs and identified substantial performance variability and systematic biases, with results strongly shaped by dataset composition, curation, and circularity effects, underscoring the critical importance of rigorous evaluation design (Radjasandirane et al., 2025).

VEP methodology has evolved rapidly, with the most recent approaches relying heavily on deep learning. However, relatively few predictors fully exploit protein language models (PLMs), despite their strong performance across a broad range of protein-centric tasks. To address this gap, we developed PATHOS (Radjasandirane et al., 2026), which integrates embeddings from an optimized pair of PLMs (ESM C 600M and Ankh2 Large) together with complementary biological features, including phylogenetic probabilities, allele frequency, and protein annotations. These inputs are fused through a fully connected architecture to yield pathogenicity predictions. Across clinical benchmarks, PATHOS outperforms competing methods, reaching an MCC of 0.591 on a carefully curated clinical dataset and 0.826 on ClinVar, surpassing leading tools. Case studies on the progesterone receptor and the KCNQ1 ion channel further show that PATHOS can pinpoint functionally critical regions and known pathogenic variants that are missed by other state-of-the-art predictors, including AlphaMissense.

More recently, we developed HECATE (Radjasandirane et al., in preparation), a multi-task deep-learning framework that jointly predicts missense-variant pathogenicity, stability change (ΔΔG), and functional impact, aiming to provide more mechanistically informative interpretation. HECATE uses paired wild-type vs mutant embeddings from ESM C 600M and Ankh2 Large, processed through a siamese/shared encoder followed by three task-specific heads. For pathogenicity, the model can additionally incorporate external biological features (e.g., allele frequency and protein/network annotations). We report large-scale application of HECATE, with predictions spanning ~140 million possible amino-acid substitutions across ~17,690 human proteins.

Keywords: Diseases, variants, Pathogenicity, ClinVar, AlphaMissense

Acknowledgement: These works were supported by the France 2030 program through the Idex Université Paris Cité (ANR-18-IDEX-0001_GREx).