Versatile Multi-Sample Single Cell RNA-Seq Pipeline with Extensive Customization Options

Aleksandar Daničić*, Nevena Vukojičić, Aleksandar Baburski and Ana Mijalković Lazić

Velsera, Belgrade, Serbia

aleksandar.danicic [at] velsera.com

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has become the state-of-the-art approach for describing cell subpopulation classification and cell heterogeneity. It allows addressing medical questions such as the role of rare cell populations contributing to disease progression and therapeutic resistance.

Presented here is the “Multi-Sample Clustering and Gene Marker Identification with Seurat 4.1.0”, a highly customizable workflow for single-cell data analysis, implemented in the Common Workflow Language (CWL). The workflow consists of three steps: 1. Loading scRNA-seq Expression Datasets, 2. Quality Control and Preprocessing, and 3. Clustering and Identification of Gene Markers. It supports gene-cell count matrices generated by several commonly used quantifiers (Cell Ranger counts, STAR solo, Salmon Alevin, Kallisto BUStools) coming from single or multiple single-cell datasets, different batches, as well as single or multiple samples combined in a single SingleCellExperiment object.

Each workflow step contains several implemented options, allowing a high level of customization. The quality control can be performed manually or automatically using several options for normalization (LogNormalize, Deconvolution, SCnorm and Linnorm) and batch effect correction (Seurat and Harmony). The workflow utilizes Seurat’s graph-based approach for clustering, enabling the selection of multiple clustering resolutions. The identification of gene markers on a cluster level is performed by differential expression analysis step using various tests (wilcox, bimod, roc, and DESeq2).

To illustrate the utilization of this workflow in a standard single-cell analysis, two open access datasets containing cells isolated from human pancreatic cells were processed. Four clustering resolutions were employed to achieve different degrees of granularity, after which cluster-specific marker genes were identified.

The workflow is available on the Cancer Genomics Cloud (CGC), powered by Seven Bridges and funded by the NCI. CGC is a flexible cloud platform that ensures fast execution, scalability and reproducibility of the results, offering over 1000 bioinformatics workflows. To enable researchers to use this analysis as a guideline, this analysis was made as a public project.

Keywords: single-cell transcriptomics, cloud computing