Meta-analysis of oral shotgun metagenomes for prediction of caries and periodontitis

Serafim Dobrovolskii1, Aleksandra Denisova2*, Polina Kuznetsova1, Layal Shaheen3, Dmitrii Kharitonov4, Anna Ilinskaya5, Michael Agami6, Valery Ilinsky5 and Alexander Rakitko7

1Genotek Center: AI in Personalized Medicine, ITMO University, Saint-Petersburg, Russia

2Genotek Ltd., Moscow, Russia, National Research University Higher School of Economics, Russian Federation

3Genotek Ltd., Moscow, Russia, Moscow Center for Advanced Studies, Moscow, Russia

4Genotek Ltd., Moscow, Russia, Genotek Center: AI in Personalized Medicine, ITMO University, Saint-Petersburg, Russia

5Eligens SIA, Riga, Latvia

6Michael Agami's Family Dentistry

7Genotek Ltd., Moscow, Russia, National Research University Higher School of Economics, Russian Federation, Genotek Center: AI in Personalized Medicine, ITMO University, Saint-Petersburg, Russia

alexandraa.denisova [at] gmail.com

Abstract

The transition from microarrays to whole-genome sequencing (WGS) enables simultaneous analysis of the host genome and oral microbiome. Oral dysbiosis has been linked to systemic diseases through immune and metabolic mechanisms, highlighting microbiome features for disease risk assessment.

In this work, publicly available shotgun metagenomic saliva datasets were collected for caries (K02) and periodontitis (K05), as other dental diseases lacked sufficient data for integrative analysis. After retrieval from SRA/ENA, data from 14 independent WGS salivary studies, comprising a total of 545 samples, were selected and subjected to quality filtering, host read removal, and taxonomic and functional profiling using MetaPhlAn and HUMAnN.

At the taxonomic level, samples were more strongly separated by bioproject than by clinical group. In Bray–Curtis space, PERMANOVA showed that bioproject explained 28.6% of the variation (R² = 0.286, p < 0.001), whereas the clinical group accounted for 4.1% (R² = 0.041, p < 0.001). Despite the strong batch effect, disease-associated differences remained significant across all cohorts.

Characteristic microbial patterns were reproduced for periodontitis: enrichment of Porphyromonas gingivalis, Treponema denticola, Tannerella forsythia, and representatives of Saccharibacteria, as well as enhancement of biosynthetic pathways of polyamines and L-ornithine. Caries was characterized by Streptococcus mutans, Veillonella parvula, pathways of carbohydrate metabolism, and biofilm formation.

Predictive models were developed using taxonomic, functional, and combined feature sets with CLR, ILR, and rank-based transformations. The best performance for periodontitis was achieved by the ridge logistic regression model using CLR-transformed taxonomic data (AUC = 0.903 ± 0.011), comparable to previously reported plaque-based shotgun metagenomic models (AUC ~0.93–0.97). In contrast, the highest predictive performance for caries reached AUC = 0.802 ± 0.045. Leave-one-study-out (LOSO) validation for periodontitis demonstrated good generalizability, with AUC values reaching 0.876 and 0.934, while performance decreased moderately for clinically heterogeneous cohorts (AUC = 0.778 and 0.735).

Overall, the resulting models demonstrated robust and stable performance in LOSO validation, supporting their potential application for predicting oral diseases in a large unlabeled cohort and representing, to our knowledge, one of the largest integrative shotgun metagenomic analyses of saliva microbiomes for oral disease prediction.

Keywords: metagenomics, saliva microbiome, caries, periodontitis