Uxia Veleiro1*, David Mendez2, Noël Malod-Dognin3, Natasa Przulj4 and Mikel Hernaez5
1Mohamed bin Zayed University of Artificial Intelligence
2University of Granada
3School of Digital Public Health, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates
4Mohamed Bin Zayed University of Artificial Intelligence
5CIMA University of Navarra
uxia.veleiro [at] mbzuai.ac.ae
Abstract
Drug combination therapies are a promising therapeutic strategy for complex diseases, with particular relevance in oncology. By targeting multiple pathways simultaneously, such combinations can overcome drug resistance and improve treatment outcomes. However, given the combinatorial size of the search space, experimentally screening all possible drug combinations remains prohibitively expensive. In this context, machine learning models have emerged as a powerful approach for prioritizing potentially synergistic drug pairs. Yet, their performance is highly sensitive to the evaluation protocol used. While recent work has highlighted that evaluation with random data splits yields overly optimistic performance estimates, most methods either continue to rely on random splits or lack comparisons under stricter evaluation schemes.
In this work, we first introduced a reproducible evaluation framework for drug synergy prediction that goes beyond random splits by defining increasingly difficult generalization scenarios. The framework is easily adaptable to new models and enabled systematic comparison of model performance across prediction settings. We then evaluated both off-the-shelf machine learning baselines and deep learning architectures, with a particular focus on whether different drug and cell-line featurizations led to meaningful differences in generalization performance. More specifically, we investigated whether external biological information, including pathway-level knowledge, can provide useful inductive biases.
Our results showed that conclusions drawn from random evaluation settings do not necessarily hold under stricter generalization scenarios, and, further, some widely used feature construction methods rely on transductive dependencies that complicate their correct assessment. Further, predicting synergy on unseen drugs is a major bottleneck, making model rankings less stable under stricter generalization settings. In addition, we showed how incorporating biological information can reduce the need for highly parameterized models, pointing toward more efficient and biologically grounded approaches.
Overall, our framework provides a reproducible way to benchmark drug synergy models in controlled generalization scenarios, allowing systematic comparison across models. Within this setting, we revealed differences that remain hidden under simpler splits, highlighting that robust prediction depends as much on evaluation design and biological knowledge as on model complexity.
Keywords: drug synergy, benchmarking, generalization evaluation

