Benchmarking of pre-processing steps in Sarek pipeline on MGI sequenced standard human genome

Marina Jelovac^1*, Djordje Pavlovic¹, Nevena Vucinic², Novak Martinovic², Nikola Kotur¹, Aleksandra Vitkovac¹, Saša Todorović¹, Iva Sabolic³, Aleksandar Mihajlovic², Branka Zukic¹ and Biljana Stankovic¹

¹Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Serbia

²Persida doo

³Labena doo

marina.jelovac [at] imgge.bg.ac.rs

Abstract

The expanding availability of next-generation sequencing (NGS) technologies has enabled the generation of massive genomic datasets. In addition to challenges related to data storage and processing, ensuring reproducibility of genome analyses remains critical. Workflow management systems such as Nextflow have been developed to address these issues. Sarek, a freely available nf-core pipeline for detecting genetic variants in whole-genome sequencing data, integrates widely accepted bioinformatics tools considered the gold standard in life sciences. However, NGS protocols differ in library preparation methods, which may influence the necessity of certain analytical steps. Notably, MGI library preparation does not involve PCR amplification, potentially reducing the need for duplicate marking and base quality score recalibration (BQSR). Eliminating these computationally intensive steps could significantly reduce analysis time and resource usage.

The aim of this study was to assess the necessity of alignment preprocessing steps and evaluate the impact of variant caller choice when applying the Sarek pipeline to MGI sequencing data.

The Sarek pipeline was adapted to include multiple analysis strategies applied to the NA12878/HG001 human genome sample, sequenced using MGI technology and provided by Genome in a Bottle consortium. Reads were aligned to the GRCh38 reference genome using bwa-mem. Preprocessing steps included adapter trimming (fastp), duplicate marking (GATK MarkDuplicatesSpark), and base quality score recalibration (GATK BaseRecalibrator and ApplyBQSR), applied individually, in combination, or omitted. Variant calling was performed using GATK HaplotypeCaller and DeepVariant. Variant call files were benchmarked against a known truth set using hap.py, and performance was evaluated using F1 scores for SNPs and INDELs.

No significant differences in variant calling performance were observed between analyses that included duplicate marking and BQSR and those that omitted them. DeepVariant consistently outperformed HaplotypeCaller, although differences were modest. For INDEL detection, DeepVariant achieved an F1 score of 99.5% compared to 99.1%, while for SNP detection it reached 99.7% versus 99.2%.

In conclusion, duplicate marking and BQSR have limited impact on variant calling accuracy in MGI data and can be omitted to reduce computational cost. DeepVariant showed a consistent advantage and may be the preferred variant caller for this data type.

Keywords: bioinformatics, sequencing, sarek, pipeline, benchmarking

Acknowledgement: This work has been funded by EC project HORIZON-WIDERA-2023-ACCESS-07-01: InnoThyroGen (GA No. 101187880)

2026, Belgrade

Usefull Links

Contact Us