Palmacci Vincenzo, Nahal Yasmine, Welsch Matthias, Engkvist Ola, Kaski Samuel, Kirchmair Johannes
Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, 1090, Vienna, Austria.
Vienna Doctoral School of Pharmaceutical, Nutritional and Sport Sciences (PhaNuSpo), University of Vienna, 1090, Vienna, Austria.
J Cheminform. 2025 Apr 29;17(1):64. doi: 10.1186/s13321-025-01014-3.
Assay interference caused by small organic compounds continues to pose formidable challenges to early drug discovery. Various computational methods have been developed to identify compounds likely to cause assay interference. However, due to the scarcity of data available for model development, the predictive accuracy and applicability of these approaches are limited. In this work, we present E-GuARD, a novel framework seeking to address data scarcity and imbalance by integrating self-distillation, active learning, and expert-guided molecular generation. E-GuARD iteratively enriches the training data with interference-relevant molecules, resulting in quantitative structure-interference relationship (QSIR) models with superior performance. We demonstrate the utility of E-GuARD with the examples of four high-quality data sets on thiol reactivity, redox reactivity, nanoluciferase inhibition, and firefly luciferase inhibition. Our models reached MCC values of up to 0.47 for these data sets, with two-fold or higher improvements in enrichment factors compared to models trained without E-GuARD data augmentation. These results highlight the potential of E-GuARD as a scalable solution to mitigating assay interference in early drug discovery. SCIENTIFIC CONTRIBUTION: We present E-GuARD, an innovative framework that combines iterative self-distillation with guided molecular augmentation to enhance the predictive performance of QSAR models. By allowing models to learn from newly generated, informative compounds through iterations, E-GuARD facilitates the understanding of underrepresented structural patterns and improves performance on unseen data. When applied across different interference mechanisms, E-GuARD consistently outperformed standard approaches. E-GuARD establishes the foundation for further research into dynamic data enrichment and more robust molecular modeling.
小分子有机化合物引起的分析干扰继续给早期药物发现带来巨大挑战。人们已经开发了各种计算方法来识别可能导致分析干扰的化合物。然而,由于可用于模型开发的数据稀缺,这些方法的预测准确性和适用性受到限制。在这项工作中,我们提出了E-GuARD,这是一个新颖的框架,旨在通过整合自蒸馏、主动学习和专家指导的分子生成来解决数据稀缺和不平衡问题。E-GuARD通过与干扰相关的分子迭代丰富训练数据,从而产生具有卓越性能的定量结构-干扰关系(QSIR)模型。我们以硫醇反应性、氧化还原反应性、纳米荧光素酶抑制和萤火虫荧光素酶抑制这四个高质量数据集为例,展示了E-GuARD的效用。对于这些数据集,我们的模型达到了高达0.47的MCC值,与未使用E-GuARD数据增强训练的模型相比,富集因子提高了两倍或更高。这些结果突出了E-GuARD作为缓解早期药物发现中分析干扰的可扩展解决方案的潜力。科学贡献:我们提出了E-GuARD,这是一个创新框架,它将迭代自蒸馏与引导式分子增强相结合,以提高QSAR模型的预测性能。通过允许模型通过迭代从新生成的、信息丰富的化合物中学习,E-GuARD有助于理解代表性不足的结构模式,并提高对未见数据的性能。当应用于不同的干扰机制时,E-GuARD始终优于标准方法。E-GuARD为进一步研究动态数据富集和更强大的分子建模奠定了基础。