Deeply phenotyped and multi-omics data integration with the Human Phenotype Project

Tarmo Nurmi*, Noël Malod-Dognin, Luiz Maniero, Eduardo da Veiga Beltrame and Nataša Pržulj

School of Digital Public Health, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates

tarmo.nurmi [at] mbzuai.ac.ae

Abstract

Large-scale, deeply phenotyped and multi-omics datasets are becoming more and more accessible through initiatives such as the UK Biobank, the Human Phenotype Project, Our Future Health, the Danish Precision Health Initiative DELPHI, All of Us, and the Emirati Genome Program. These rich datasets hold great potential for biomedical research. However, individual data types are often studied in isolation, or statistical associations are mined between only two data types at a time. In contrast, multifaceted datasets enable new biomedical insights that can be achieved only through a joint consideration of multiple data types. Yet, integrating these large-scale and highly heterogeneous data into a holistic view for effective data mining remains a challenge. An effective analysis framework should flexibly include diverse data types, from molecular and genetic data to phenotypic features and electronic health records and linkages between them. In addition, multiple data sources should be combined in order to take into account existing knowledge about human biology, molecular interactions, and drug properties. Recently, such flexible data integration and embedding frameworks based on treating the different data types as matrices and combining multiple matrices into one model using joint non-negative matrix tri-factorization (NMTF) have achieved success in identifying novel gene-disease associations and molecule candidates for drug repurposing from a wide variety of heterogeneous data.

In this work, we constructed a multi-omics NMTF-based integration framework on the Human Phenotype Project deeply phenotyped multi-omics data set combined with external data sources. The framework integrates gene expression data (RNA-Seq), medication use, protein-protein interaction data, drug-target interaction data, and drug chemical similarity data, with unrestricted possibilities to include additional data types such as microbiome data and many others. The gene-participant-drug embeddings produced by the framework, together with participant electronic health records, have implications for uncovering novel gene-disease associations, patient stratification, drug repurposing, and exploration of semantic relationships between different biomedical entities using linear vector operations.

Keywords: Multi-omics, deep phenotyping, data integration