The sequence complexity estimates: algorithms and applications

Yuriy L. Orlov1*, Nina G. Orlova2

1 Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, Russia

2 Financial University under the Government of Russian Federation, Moscow, Russia

orlov [at] d-health.institute

Abstract

We discuss current methods and tools for algorithmic estimates of genetic texts (information and entropy measures). The search DNA regions with the extreme statistical characteristics is important for biophysical models of chromosome function and gene transcription regulation in genome scale. The complexity profiling has been applied to segmentation and delineation of genome sequences, search for genome repeats and transposable elements, applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools available online provide service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets. Further, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. Low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.

Keywords: algorithms, text complexity, entropy, Lempel-Ziv compression, genetic code, low complexity regions, sequencing artefacts, genomic rearrangement, alignment-free, genome comparison, online tools