Fine-tuning RNA-seq alignment parameters for Danio rerio genome

Jelena Kušić-Tišma*, Mila Ljujić, Bojan Ilić, Aleksandra Divac Rankov

Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Belgrade, Serbia

jkusic [at] imgge.bg.ac.rs

Abstract

In RNA-seq analysis, mapping raw reads to the reference genome using splice-aware aligners is a matter of personal preference. However, the distribution of reads among different mapping metrics strongly depends on appropriate parameters selection.

In this study we conducted a thorough analysis of parameter impact on alignment metrics using STAR aligner in 12 samples of PE150 RNA-seq data from zebrafish. When setting parameters we considered factors such as insert size distribution of pre-processed library data by fastp and the unique features of the reference genome (e.g., gene density and intron size distribution).

Average number of input reads per sample were 25097912. Adjusting parameters led to significant improvements in mapping metrics, notably an increase in the percentage of uniquely mapped reads from 86.08% to 94.19%. A minor rise was observed in the percentage of reads mapped to multiple loci, with figures of 4.72% and 3.62% respectively. The quantification of reads by genomic origin has revealed an increase in uniquely mapped reads allocated to exonic regions , on average by 2,034,563 reads per sample.

In addition, we’ve compared default and adjusted metrics of multi-sample 2-pass mapping, which are important when analysing differential transcript usage. In the 2-pass mode, we employed 1st pass junction files that were purged of probable false positives, such as junctions within the mitochondrial genome or those crossed by multi-mapping reads. Qualimap-rna analysis of 2-pass mapping revealed an increase in the usage of novel splice sites, as expected. However, default parameters showed higher rates of multimapping, noFeature, or ambiguous reads compared to adjusted parameters.

Running alignment with default options generally performs well initially. However, proper parameter selection tailored to the characteristics of the raw library data and reference genome significantly improves alignment metrics, especially for model organisms like zebrafish.

Keywords: RNA-seq analysis, STAR aligner, mapping parameters, zebrafish

Acknowledgement: This work was funded by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia (Contract No. 451-03-66/2024 03/200042)