Applied Biotechnology Research Center, Baqiyatallah University of Medical Sciences, Tehran, Iran.
Mol Inform. 2024 Apr;43(4):e202300292. doi: 10.1002/minf.202300292. Epub 2024 Feb 15.
When designing a machine learning-based scoring function, we access a limited number of protein-ligand complexes with experimentally determined binding affinity values, representing only a fraction of all possible protein-ligand complexes. Consequently, it is crucial to report a measure of confidence and quantify the uncertainty in the model's predictions during test time. Here, we adopt the conformal prediction technique to evaluate the confidence of a prediction for each member of the core set of the CASF 2016 benchmark. The conformal prediction technique requires a diverse ensemble of predictors for uncertainty estimation. To this end, we introduce ENS-Score as an ensemble predictor, which includes 30 models with different protein-ligand representation approaches and achieves Pearson's correlation of 0.842 on the core set of the CASF 2016 benchmark. Also, we comprehensively investigate the residual error of each data point to assess the normality behavior of the distribution of the residual errors and their correlation to the structural features of the ligands, such as hydrophobic interactions and halogen bonding. In the end, we provide a local host web application to facilitate the usage of ENS-Score. All codes to repeat results are provided at https://github.com/miladrayka/ENS_Score.
在设计基于机器学习的评分函数时,我们可以访问具有实验确定的结合亲和力值的有限数量的蛋白质-配体复合物,这些复合物仅代表所有可能的蛋白质-配体复合物的一部分。因此,在测试时报告置信度度量并量化模型预测的不确定性至关重要。在这里,我们采用一致预测技术来评估 CASF 2016 基准核心集中每个成员的预测置信度。一致预测技术需要使用多样化的预测器集合进行不确定性估计。为此,我们引入了 ENS-Score 作为一个集成预测器,它包括 30 种具有不同蛋白质-配体表示方法的模型,在 CASF 2016 基准核心集上实现了 0.842 的皮尔逊相关系数。此外,我们还全面研究了每个数据点的残差,以评估残差分布的正态性行为及其与配体结构特征(如疏水相互作用和卤键)的相关性。最后,我们提供了一个本地主机网络应用程序,以方便使用 ENS-Score。重复结果的所有代码都可在 https://github.com/miladrayka/ENS_Score 上找到。