INMTD: integrative clustering with 2D genotypes and 3D facial images in the presence of confounders

Zuqi Li1,2*, Sam F. L. Windels, Seth M. Weinberg, Mary L. Marazita, Susan Walsh, Mark D. Shriver, Noël Malod-Dognin, David W. Fardo, Peter Claes, Nataša Pržulj, Kristel Van Steen1,3

1 Department of Human Genetics, KU Leuven, Leuven, Belgium

2 Medical Imaging Research Center, UZ Leuven, Leuven, Belgium

3 GIGA-R Medical Genomics, University of Liège, Liège, Belgium

zuqi.li [at] kuleuven.be

Abstract

By integrating genomic and facial images, we can achieve a more comprehensive, multi-view clustering of individuals. Among the various approaches for multi-view clustering, integrative nonnegative matrix tri-factorization (NMTF) has emerged as advantageous in learning low-rank embedding of samples and features and interpreting these representations. Incorporating 3D imaging is challenging, but here is where nonnegative Tucker decomposition (NTD) can come in. In this work, we introduce a novel multi-view clustering method based on both NMTF and NTD, namely INMTD, that integrates genotypes and 3D facial images to generate unconfounded subgroups of individuals. Indeed, there is a need to handle unwanted drivers of clusterings (i.e. confounders). We applied our method to real-life multi-view data on 4680 individuals from a US cohort. Several confounders were also available, such as age, sex and height. When removing these factors, one would expect population structure to be the prevailing driver for the heterogeneity. In particular, INMTD generates three embedding matrices for 1) samples, 2) SNPs and 3) facial landmarks. The biological relevance of these embeddings was investigated in several ways. For 1), most sample embedding vectors were statistically significantly associated with ancestry axes or confounders. By removing confounded vectors in the sample embedding, we derived an unconfounded clustering with better internal quality and stronger association with population structure; the genetic and facial annotations of each derived subgroup highlighted different physiological or morphological characteristics. Regarding 2), clusters of embedded SNPs showed good enrichment of genes and Gene Ontology terms. For 3), the segmentation on the facial embedding improved cophenetic correlation compared to earlier reports on the same data. Projecting SNPs and facial landmarks to the sample embedding space revealed known and novel SNP-face biological relationships. In conclusion, INMTD can effectively integrate omics data and 3D images for unconfounded clustering with biologically meaningful interpretation.

Keywords: multi-view clustering, matrix factorization, confounder, population genetics, facial images

Acknowledgement: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreements No 813533 (MLFPM) and No 860895 (TranSYS).