Development and validation of a reproducible RNA-seq pipeline for transcriptomic analysis

Nikola Jocic*, Anita Skakic, Marina Parezanovic, Marina Andjelkovic, Nina Stevanovic, Sara Stankovic, Kristel Klaassen, Vesna Spasovski, Milena Ugrin, Jovana Komazec, Kristina Grujic and Maja Stojiljkovic

Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Serbia

nikola.jocic [at] imgge.bg.ac.rs

Abstract

RNA sequencing (RNA-seq) has become a standard approach for transcriptomic analysis, yet building reproducible, well-documented pipelines remains a practical challenge for many research groups. Publicly available automated pipelines such as nf-core/rnaseq offer robust solutions, but provide limited opportunity to develop a deep understanding of each tool used in the workflow. Here we describe the development and validation of a custom, modular RNA-seq pipeline built from widely used bioinformatics tools, designed to be reproducible and adaptable to different experimental contexts.

The pipeline was developed and tested using a publicly available RNA-seq dataset (SRP139131) from human T cells stimulated with three anti-CD3 antibodies (OKT3, FvFcM, FvFcR), and an untreated control (Illumina technology, 150 bp paired-end reads). The pipeline consisted of two parts. The first part, implemented in Bash, covered quality control (FastQC, MultiQC), adapter trimming (fastp), splice-aware alignment to the human reference genome GRCh38 (HISAT2), post-alignment quality control (RSeQC), and gene-level quantification (featureCounts). The second part, implemented in R, performed differential expression analysis (DESeq2), followed by functional enrichment analysis using Gene Ontology and KEGG pathway databases (clusterProfiler). We validated pipeline performance by comparing the generated count matrix against that produced by the nf-core/rnaseq pipeline (RSEM quantification) using per-sample Spearman correlation and principal component analysis (PCA).

Per-sample Spearman correlation between count matrices ranged from 0.884 to 0.896 across all detected genes, and improved to 0.950–0.952 when restricted to expressed genes with a count above 5. PCA revealed that the first principal component (32.9% variance explained) separated the two pipelines, reflecting a systematic quantification difference attributable to algorithmic differences between featureCounts and RSEM, while the second principal component (16.3%) captured biological variation consistently across both pipelines, with clear separation of conditions (untreated, FvFcM, FvFcR, OKT3).

The custom pipeline produced biologically consistent results comparable to a well-established reference pipeline, with observed count differences consistent with known tool-specific quantification behaviour. The modular design facilitates adaptation to different experimental designs and provides a framework suitable for researchers seeking to understand each analytical step.

Keywords: RNAseq, pipeline, bioinformatics, transcriptomics

Acknowledgement: This research was supported by the Horizon Europe Project BRIDGING-RD, HORIZON-WIDERA-2023-ACCESS-02, Grant Agreement N°101160079.