Spectral Clustering for Transcriptomics Data: an Approximate Column Sampling Approach on the GPU

Marko Mišić1*, Lazar Smiljković and Predrag Obradović1

1 University of Belgrade, School of Electrical Engineering, Belgrade, Serbia

marko.misic [at] etf.bg.ac.rs

Abstract

Clustering is one of the central methods in bioinformatics processing pipelines. It is used to capture hidden patterns in high-dimensional data, especially at the genomic level where large quantities of gene expression data have been produced with the advent of high-throughput sequencing technologies. Clustering reveals natural structure in the data, helping in understanding gene function, regulations, and cellular processes, cell assignment and subtyping, and other downstream analyses.

Spectral clustering has been successfully used to discover structure and patterns in data for bioinformatics research in the past with short execution times and high accuracy. However, it can be computationally expensive for large-scale datasets that are present in contemporary single-cell RNA (sc-RNA) and spatial transcriptomics (ST) applications. Numerous approximate spectral clustering algorithms have been proposed in the open literature to improve efficiency and scalability, while maintaining clustering quality.

Spectral clustering uses linear algebra operations which can be efficiently implemented on modern central processing units (CPUs) and graphics processing units (GPUs). In our work, we implemented an approximate, parallel spectral clustering method based on column sampling and the Nystrom method on the GPU. Our implementation significantly reduces both the computational and memory requirements of the previous methods, enabling the method to handle datasets of more than 106 samples and thousands of features.

We evaluated our approach using general datasets containing images of handwritten digits, as well as several datasets from single-cell and spatial transcriptomics sequencing of mouse brain. Datasets from sc-RNA and ST domains were characterized by a modestly large number of points and a relatively higher number of clusters compared to the general datasets. We compared our solution to the widely used Leiden algorithm in terms of speedup, ARI and NMI score. Our results showed significant time advantage and scalability over Leiden algorithm of up to a hundred times, albeit with to some extent lower ARI and NMI scores.

Keywords: bioinformatics, approximate spectral clustering, GPU programming, transcriptomics

Acknowledgement: This work was financially supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia under contract numbers: 451-03-65/2024-03/200103 and 451-03-66/2024-03/200103, and Complete Genomics, part of MGI Tech, under contract number: 1847/2022-12. The authors gratefully acknowledge the financial support.