DigestedProteinDB: A compact key–value database for in silico peptide digestion and mass-based search

Krešimir Križanović1, Janko Diminic2*, Antonio Starcevic2, Toni Čvrljak3 and Jurica Zucko2

1University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb

2University of Zagreb Faculty of Food Technology and Biotechnology, Zagreb

3University of Zagreb Faculty of Mining, Geology and Petroleum Engineering, Zagreb, Croatia

jdiminic [at] pbf.hr

Abstract

Efficient comparison of experimental peptide masses with theoretical values represents a central step in mass spectrometry (MS)–based proteomics, microbial biotyping, and MS imaging. As protein sequence databases such as UniProtKB continue to expand, the need for fast, scalable access to large collections of in silico–digested peptides has grown substantially. Existing approaches either relied on computationally demanding on-the-fly digestion or required high-performance server infrastructure for centralized peptide databases, limiting their accessibility for routine laboratory use.

We developed DigestedProteinDB, a compact and high-performance key–value database of peptides generated by enzymatic in silico digestion of UniProtKB/Swiss-Prot and TrEMBL sequences. We implemented the system using RocksDB and incorporated multiple optimization strategies to minimize disk footprint and accelerate mass-range queries. We discretized peptide masses to four decimal places and stored them as 32-bit integer keys, enabling efficient range-based retrieval aligned with the LSM-tree architecture of RocksDB. We encoded peptide sequences using a compact 5-bit scheme per amino acid residue, yielding approximately 37.5% reduction in raw sequence storage compared to standard byte-based encoding. Furthermore, we serialized UniProt accession numbers using Base36 encoding and represented them as 64-bit integers, while maintaining internal integer identifiers via an in-memory array for constant-time resolution. We applied additional compression using Snappy at the RocksDB block level.

We benchmarked the system using 252 million UniProtKB protein sequences digested with trypsin, allowing up to two missed cleavages and peptide lengths of 6–50 amino acids, yielding approximately 5.9 billion peptide entries. The resulting database occupied approximately 250 GB of disk space and required less than 16 GB of RAM during both construction and querying. We completed database construction in approximately two days, with an average mass-range query time of 200–300 ms. We performed all operations on a standard workstation without high-performance server infrastructure.

DigestedProteinDB is publicly accessible via a web interface and REST API. The modular design allowed rapid generation of experiment-specific databases, making it suitable for integration into diverse MS-based analytical workflows.

Keywords: peptide, database, RocksDB, UniProtKB, mass

Acknowledgement: This research was funded by the European Union (NextGenerationEU) project “The glycome and microbiome as markers of dietary impact on the health of women of reproductive age” within the program ‘Targeted Scientific Research’ (NPOO.C3.2.R3-I1.04.0073).