Advancing Supervised Machine Learning for scRNA-seq Data Analysis

Xin Lin1*, Minjie Lyu1, Tian-yi Qiu2, Guanglan Zhang3, Sen Lin1, Lou Chitkushev3 and Vladimir Brusić1,3

1 Smart Medicine Laboratory, University of Nottingham, Ningbo, China

2 Institute of Clinical Science, Zhongshan Hospital; Fudan University, Shanghai, China

3 Metropolitan College, Boston University, Boston, USA

xin.lin [at] nottingham.edu.cn and vladimir.brusic [at] nottingham.edu.cn

Abstract

The exponential growth of single-cell transcriptomic data presents a significant challenge for the analysis of single-cell transcriptomic data. Current best practices rely on unsupervised clustering. The applications of supervised machine learning (ML) for the analysis of single-cell transcriptomic (scRNA-seq) data have increased in recent years. The main advantages of supervised ML are higher classification accuracy, and reproducibility and reliability of results, compared to unsupervised clustering. However, single-cell transcriptomic technologies are evolving rapidly, resulting in limited reproducibility of results due to changes in biological sample processing and technical differences between subsequent experimental measurements. A lack of high-quality standardized reference datasets increases the risk of model overfitting and reduces model generalization properties. Benchmarking supervised machine learning algorithms is challenging because of the lack of reference datasets.

For the advancement of scRNA-seq applications, we need high-quality annotated standardized datasets. To address the need for the deployment of supervised ML in this field, we developed a single-cell transcriptomic database of reference datasets for healthy human peripheral blood mononuclear cells (PBMC). We collected over two million single-cell data from multiple public data sources and applied advanced cell annotation methods to create multi-annotation labels in the healthy PBMC reference dataset. Each cell has labels designating cell type and subtypes, cell cycle, and cell state, along with assigned degree of belief. The annotations are based on the multi-dimensional cell ontology that we have designed. scRNA-seq data in our database were converted into a standardized format using a defined protocol that enables the direct use of data for supervised ML tasks. The data standardization pipeline and cell annotation tools are deployed within the database. The database is deployed as a publicly accessible web server for the study of single-cell PBMC.

Keywords: supervised machine learning, single cell transcriptome, transcriptome database, cell annotation.

Acknowledgements: This work was supported by the University of Nottingham Ningbo China High-Flyer Scholarship, code: 2106HFB.