Milana Djonovic1*, Andrew Norgan1 and Alexej Abyzov2
1Mayo Clinic
2Mayo clinic
milana.djonovic [at] gmail.com
Abstract
Available viral detection tools provide broad taxonomic coverage but often overreport low-abundance viral targets and fail to address contamination-derived signals. We developed and evaluated a contamination-aware pan-viral metagenomic workflow focused on contamination removal and interpretable taxonomic reporting. A clinical control cohort of 62 samples with enriched viral DNA/RNA across multiple tissue types was analyzed using both a clinical benchmark workflow and our in-house workflow, which removes contamination-associated signals through mapping against artificial vector sequences. Initial benchmarking demonstrated strong concordance between workflows. However, read counts frequently differed because results were reported at different taxonomic levels (strain, species, genus, etc.), highlighting the need for taxonomic collapsing. In general, clinical interpretation is often better served by genus- or group-level identification than strain-level resolution, thus taxonomic collapsing can improve interpretability and also amplify detection signal. We therefore identified the taxonomic levels at which the workflows converged and implemented targeted realignment using RefSeq references for the corresponding categories. This enabled aggregation of read counts within groups and quantitative comparison of read support across workflows. Together, these findings support a two-step framework using a compact reference database for rapid screening followed by targeted realignment for signal refinement and improved taxonomic resolution.
Keywords: clinical metagenomics, contamination, databases, viruses

