Classification of coronavirus types based on repeats derived from amino acid and nucleotide sequences

Anđela Damnjanović^*

Faculty of Mathematics, Belgrade, Serbia

andjela.damnjanovic13 [at] gmail.com

Abstract

The spike protein is one of the four major structural proteins that constitute coronaviruses, in addition to the membrane, nucleocapsid, and envelope proteins. These proteins play important roles in different stages of the infectious cycle. Despite its relatively small size (compared to the entire genome), the spike protein is the most important, as it enables the virus to bind to the host cell receptor and enter the cell, thereby providing a suitable environment for the creation of new viral copies. Since different types of coronaviruses can infect different hosts, there must also be differences in the spike proteins that bind to those hosts.

Therefore, the aim of this study was to develop and evaluate several classification models for predicting coronavirus types based on features extracted from amino acid and nucleotide sequences. Repeated subsequences (repeats) were used as features. In nucleotide sequences, four types of repeats can be identified: direct non-complementary, direct complementary, inverse non-complementary, and inverse complementary. On the other hand, due to the absence of a complementarity concept among amino acids, only direct non-complementary and inverse non-complementary repeats can be found in amino acid sequences.

The dataset used in this study consisted of spike proteins from 20 different types of coronaviruses, resulting in over 26,000 instances. The implemented classification models included Decision Trees, k-Nearest Neighbors (kNN), Neural Networks, and the Naive Bayes classifier, as well as ensemble methods such as Random Forest and XGBoost. In addition, a Voting classifier was tested, combining individual classifiers with complementary strengths. Model performance was evaluated using standard metrics: precision, recall, and F1-score.

The models achieved very high performance, reaching up to 99% accuracy, with precision and recall exceeding 90% on most datasets, and macro-averaged F1-scores above 0.9 in the majority of cases, thus showing that repeat-based features provided meaningful information for distinguishing between coronavirus types. Ensemble methods, particularly Random Forest and XGBoost, achieved the best overall performance, while combined classifiers further improved prediction accuracy by reducing individual model weaknesses.

In conclusion, text-inspired feature extraction from biological sequences effectively supports virus classification, while combining models and engineered repeat features improves robustness and accuracy.

Keywords: coronavirus, classification, classification methods, repeats

Acknowledgement: The author is partially supported by the Ministry of Science, Technological Development and Innovation, Republic of Serbia, through the project 451-03-33/2026-03/200104.

2026, Belgrade

Usefull Links

Contact Us