Sequence-based Hierarchical Classification of Tandem Repeats using Neural Network Models

Nevena Ćirić1*, Jovana Kovačević1

1 Faculty of Mathematics, University of Belgrade, Belgrade, Serbia

nevena.ciric [at] matf.bg.ac.rs

Abstract

Repeat proteins are a widespread class of mostly non-globular proteins containing repetitive subsequences, a so-called repeat units that often occur in tandem arrangements when observed in 3D structure of the protein. These tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues.

Improvement of the methods for identification of protein tandem repeats and subsequently the increasing number of the known proteins containing repetitive elements necessitates their classification to facilitate further understanding of their sequence-structure-function relationships. According to Kajava’s classification scheme based on the repeat unit’s length, general structural arrangement and mode of interaction between the repeat units [1], tandem repeats are classified into five main classes and further divided into subclasses that reflect repeat unit topology, differing in secondary structure arrangement and/or overall structure within the repeat.

The classical approach to obtain the (sub)class assignment for a newly identified tandem repeat is by simply transferring this information from the “master” repeat unit, that is the repeat unit from database of predetermined tandem repeats with associated (sub)classes found to be most similar to the newly identified tandem repeat. This procedure usually implies some kind of structural search algorithm in order to assign master repeat unit, which further implies known tertiary structure of the protein with newly discovered tandem repeat. With intention to tackle the problem of analyzing proteins with unknown 3D structure and facilitate classification of tandem repeats by using only the sequence information, as well as to explore sequence-structure relationship between repeat units sequence and structural characteristics of their corresponding (sub)class, here we propose neural network based model for classification of tandem repeats based on the multiple sequence alignment of its units sequences. Additionally, this model can be further utilized to create an end-to-end pipeline for identification and classification of tandem repeats only from sequence information.

Keywords: tandem repeats, classification, neural network, sequence-based