Dario Radojcic1* and Jovana Kovacevic2
1University of Belgrade, Faculty of Phyiscs
2Associate professor
darioradojcic105 [at] gmail.com
Abstract
Protein function prediction using the Gene Ontology (GO) is a highly imbalanced multi-label learning problem in which labels are also semantically structured. In this setting, protein function is represented by a set of GO terms, while the ontology itself is organized as a directed acyclic graph (DAG). The model’s objective is therefore to predict the subset of ontology nodes corresponding to a protein’s functional annotations. Standard binary cross-entropy(BCE) treats GO terms as independent targets and assigns the same penalty to all errors, despite the fact that some incorrect predictions are semantically closer to the correct annotation than others. This creates two related limitations: BCE ignores semantic distances between GO terms, while the strong imbalance of GO annotations allows frequent terms and abundant negative labels to dominate the learning signal.
We propose a composite loss function designed for GO-based protein function prediction. The first component is inspired by asymmetric loss and is intended to reduce the effect of label imbalance by modulating the contribution of common negative examples and frequent GO terms. The second component introduces GO-term embeddings that capture learned semantic relatedness between ontology nodes, including relationships not directly represented by parent-child edges in the GO DAG. Prediction errors are then weighted according to distances in this embedding space, allowing the loss to distinguish between semantically close and biologically distant mistakes.
We compare the proposed loss with commonly used losses in protein function prediction, including BCE and imbalance-aware multi-label losses. Instead of modifying the underlying model architecture, we investigate whether the training objective itself can be made more consistent with the semantic structure of the GO label space. By combining asymmetric imbalance correction with embedding-based semantic error weighting, the proposed approach provides a resource-efficient framework for studying whether semantically informed loss functions can improve CAFA-style protein function prediction.
Keywords: Protein function prediction, loss function

