Exploiting the linearity of joint embedding spaces to mine cancer precision medicine knowledge

Noël Malod-Dognin* and Nataša Pržulj

Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates

noel.malod [at] mbzuai.ac.ae

Abstract

Cancer is a leading cause of death worldwide. It is a multifactorial disease that results from multiple alterations, which are only partially captured by any biological data type studied in isolation (e.g., somatic mutation profiles or dysregulated genes). Hence, novel multi-omics data-fusion and analytics methods are needed to holistically mine the wealth of all available clinical and molecular data for new cancer precision medicine discovery.

Network embedding is a cornerstone of modelling and analysis of such complex, biological datasets. In the field of NLP, it was observed that the embeddings of words capture semantic relationships linearly, allowing for efficient mining using simple linear vector operations rather than by using computationally intensive, black-box downstream analysis methods. Since then, this observation has been made in other fields, yielding the so-called linear representation hypothesis in vision language models. The question is whether this observation holds and can be exploited in the context of multi-omics data-fusion for cancer precision medicine.

To assess this, we defined a Non-Negative Matrix Tri-Factorization based framework to jointly embed the multi-omics datasets of all cancer patients from TCGA (somatic mutation profiles and gene expression profiles of 9,134 cancer patients from 33 TCGA projects), the protein interactions from BioGRID and the drug data from DrugBank into the same embedding spaces. By comparing the performances of linear and non-linear methods for downstream classification and clustering tasks, we demonstrate that our joint embedding spaces are indeed linearly organized. Then, we define linear vector operations that enable prioritizing new pan-cancer or cancer-specific genes and drug repurposings, paving the way to new cancer therapies.

Keywords: AI/ML, Multi-omics Data-Fusion, Precision Medicine