Kseniia Poliakova1*, Maxim Cheprasov2, Gavril Novgorodov2, Alexandr Rakitko1,3, Fedor Sharko4 and Artem Nedoluzhko2,4,5
1Genotek Center: AI in Personalized Medicine, ITMO University, Saint Petersburg, Russia
2Mammoth Museum, North-Eastern Federal University, Yakutsk, Russia
3Genotek Ltd., Moscow, Russia
4European University at Saint Petersburg, Saint Petersburg, Russia
5National Research University Higher School of Economics, Moscow, 117418, Russia
poliakovakseniaa1 [at] gmail.com
Abstract
Reconstructing diet from ancient metagenomic data is complicated by short DNA fragments, contamination of reference genome sequences, and ambiguous taxonomic classification. In this study, we examined plant DNA preserved in the rectal contents of a woolly mammoth (Mammuthus primigenius) calf to characterize its diet using multiple computational approaches.
Paired-end sequencing reads were trimmed to remove adapters and aligned to the human genome to eliminate contamination. The remaining reads were analyzed with three main strategies: k-mer-based taxonomic classification, alignment to a reference database of plastid genomes, and lowest common ancestor (LCA) assignment based on multiple alignments. To further improve specificity, reads mapping to bacterial genomes were filtered out.
Consistent signals across all methods were observed for taxa in the family Cyperaceae, particularly the genus Carex, and for Comarum palustre. These taxa emerged as the most reliable components of the dataset. In contrast, other taxa showed variability depending on the analytical method used. For instance, sequences initially attributed to Arisaema ringens were substantially reduced after bacterial filtering, indicating that these signals were likely present due to reference database contamination rather than actual dietary components. Multiple signals related to Poaceae were also detected but were distributed across multiple taxa, likely due to the limited representation of relevant species in reference databases, which hindered confident identification at lower taxonomic levels. A similar pattern was observed for Carex, where reads mapped to multiple related species, reflecting the absence of an exact reference genome.
The results indicate that sedges were a major part of the mammoth’s diet. This study highlights the importance of employing multiple analytical approaches and carefully considering contamination and limitations of reference databases to minimize taxonomic misclassification in ancient metagenomic studies.
Keywords: ancient DNA, metagenomics, Mammuthus primigenius

