Jovana Kovacevic*
Associate professor
jovana.kovacevic [at] matf.bg.ac.rs
Abstract
Comprehensive analysis of gene–disease associations is fundamental for biomedical and translational research; however, relevant data remain dispersed across multiple databases, often using heterogeneous formats and inconsistent disease terminology. To address these challenges, we developed ProDiGenIDB (Protein Disease Gene Integrated Database), an integrated resource designed to harmonize gene–disease relationships from diverse, publicly available databases while providing consistent gene and disease annotations.
ProDiGenIDB integrates more than 400,000 gene–disease associations collected from established sources including DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, and DISEASES. For each association, the database provides standardized gene identifiers (Gene Symbol, Entrez ID, UniProt ID, Ensembl ID), curated disease descriptors, Disease Ontology identifiers (DOIDs), and references to the original data sources, enabling traceability and interoperability.
A key component of the database construction is the normalization and harmonization of disease terminology. To achieve robust mapping of disease names to standardized ontology terms, we employed Natural Language Processing (NLP) techniques based on advanced text representation models. This approach improves consistency across heterogeneous datasets and enhances the reliability of downstream analyses.
By unifying gene–disease associations and applying ontology-based standardization supported by NLP methods, ProDiGenIDB provides a scalable and extensible platform for integrative studies. The database facilitates systematic exploration of disease mechanisms and supports data-driven research in genomics, disease classification, and precision medicine.
Keywords: gene–disease associations, data harmonization

