Biljana Stojanović1*, Saša Malkov2, Miloš Beljanski3 and Nenad Mitić2
1Mathematical Institute of the Serbian Academy of Sciences and Arts, Belgrade, Serbia
2Faculty of Mathematics, University of Belgrade, Belgrade, Serbia
3Institute for General and Physical Chemistry, Belgrade, Serbia
bstojanovic [at] mi.sanu.ac.rs
Abstract
Codon usage bias (CUB) reflects the non-random usage of synonymous codons and represents an important feature associated with viral evolution, host adaptation, and translational efficiency. We analysed CUB in approximately 5 million SARS-CoV-2 protein-coding sequences collected between 15 December 2019 and 22 May 2023. The analysis was performed on the complete dataset of protein amino acid sequences and corresponding coding sequences, as well as on 20 temporal subsets obtained by partitioning isolate collection dates into discrete time intervals. Four proteins — surface glycoprotein, nucleocapsid phosphoprotein, ORF1a polyprotein, and ORF1ab polyprotein — were selected for detailed analysis, each exhibiting more than 3% unique coding sequences relative to the total number of sequences.
For each protein type, we analyzed (i) codon usage patterns across all corresponding coding sequences using global codon frequencies and Relative Synonymous Codon Usage (RSCU) values as codon-specific CUB measures, (ii) data derived data derivedfrom alignments of individual amino acid sequences against the corresponding amino acid sequence of the reference isolate, and (iii) coding sequences corresponding to aligned amino acid sequences.
Low-frequency codons were identified across the analyzed datasets. Specific positions within each reference protein at which significant changes in codon frequencies were observed over time were identified. The analysis also examined codon usage patterns in relation to the WHO-defined lineage associated with each isolate. In addition, CUB was analyzed for each protein type at the level of complete coding sequences using the gene-specific measures Effective Number of Codons (ENC) and Relative Codon Bias Score (RCBS). These measures provided an objective characterization of CUB for each analyzed protein type and complemented the codon-specific analysis based on global codon frequencies and RSCU values.
The results indicate that the most pronounced changes in codon usage over time were observed in surface glycoprotein and nucleocapsid phosphoprotein. Some synonymous codons show relatively stable frequencies over time, whereas others exhibit decreases in codon abundance at different stages (early, intermediate, or late) and with varying magnitudes.
Keywords: SARS-CoV-2, codons, dynamic of change

