A pipeline for the identification of disease-specific genetic biomarkers using NGS sequencing data of cfDNAs in human plasma

Alessandra Vittorini Orgeas*, Christoph W. Sensen

HCEMM, Szeged, Hungary

alessandra.vittorini [at] hcemm.eu

Abstract

Circulating cell-free DNAs are DNA fragments released into the blood by different tissues. They can be isolated from a routine blood draw that is a cost-effective, fast and non-invasive procedure. While their presence is detectable under healthy conditions, there are solid evidences of their association with various clinical conditions. This explains the great interest for cell-free DNAs in clinical settings, because of their potential role as biomarkers. Furthermore, the availability of a higher number of samples and high-throughput sequencing technologies have enabled the production of massive amounts of complex genomic data. As consequence, the potentiality of cell-free DNAs can be exploited only if supported by a computational platform that streamlines the execution of the analysis and makes it reproducible and shareable across different platforms. The proposed computational pipeline addresses the complexity of this analysis. The pipeline starts with the raw data pre-processing necessary to remove residual adapters and filter the low-quality reads. It follows the composition analysis where the average sample composition is calculated and expressed in terms of specific target regions of human DNA. Next, a random forest classifier algorithm searches for a subset of the target regions that perform best at predicting the health outcome. The statistical significance of the output is validated by the MANOVA test. Finally, the pipeline runs through the original set of sequencing data to retrieve what sequences best match the composition represented by the pool of target regions deriving from the previous step. The ultimate result is a list of nucleotide sequences that have been identified as the best performing indicators of a clinical condition based on the analysis described above. Thanks to the containerization technology and workflows managers this pipeline can be shared and executed with the same functionalities across different platforms, and its installation process is automated.

Keywords: bioinformatics, computer science, DNA, sequencing, biomarkers

Acknowledgements: We thank Dr. Stefan Grabuschnig (Innophore GmbH, Austria) for the crucial help, guidance and insights in developing the computational framework of the pipeline. This work was funded in part by EU Horizon 2020 Grant No. 739593 and in part by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, with project no. TKP-2021-EGA-05 under the Thematic Excellence Programme and project no. 2022-2.1.1-NL-2022-00005 under the National Laboratory Programme.