Saha Udit Surya, Vendruscolo Michele, Carpenter Anne E, Singh Shantanu, Bender Andreas, Seal Srijit
Department of Chemistry, University of Cambridge, UK.
Broad Institute of MIT and Harvard, Cambridge, MA, US.
bioRxiv. 2024 Jul 4:2024.07.02.601740. doi: 10.1101/2024.07.02.601740.
Recent advances in machine learning methods for materials science have significantly enhanced accurate predictions of the properties of novel materials. Here, we explore whether these advances can be adapted to drug discovery by addressing the problem of prospective validation - the assessment of the performance of a method on out-of-distribution data. First, we tested whether k-fold n-step forward cross-validation could improve the accuracy of out-of-distribution small molecule bioactivity predictions. We found that it is more helpful than conventional random split cross-validation in describing the accuracy of a model in real-world drug discovery settings. We also analyzed discovery yield and novelty error, finding that these two metrics provide an understanding of the applicability domain of models and an assessment of their ability to predict molecules with desirable bioactivity compared to other small molecules. Based on these results, we recommend incorporating a k-fold n-step forward cross-validation and these metrics when building state-of-the-art models for bioactivity prediction in drug discovery.
材料科学中机器学习方法的最新进展显著提高了对新型材料性能的准确预测。在此,我们探讨这些进展是否可通过解决前瞻性验证问题——评估一种方法在分布外数据上的性能——来应用于药物发现。首先,我们测试了k折n步向前交叉验证是否能提高分布外小分子生物活性预测的准确性。我们发现,在描述真实世界药物发现环境中模型的准确性方面,它比传统的随机分割交叉验证更有帮助。我们还分析了发现率和新颖性误差,发现这两个指标有助于理解模型的适用范围,并评估其与其他小分子相比预测具有理想生物活性分子的能力。基于这些结果,我们建议在构建药物发现中生物活性预测的先进模型时,纳入k折n步向前交叉验证和这些指标。