Suppr超能文献

使用流行数据集预测结合亲和力的机器学习模型中的潜在偏差

Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets.

作者信息

Kanakala Ganesh Chandan, Aggarwal Rishal, Nayar Divya, Priyakumar U Deva

机构信息

International Institute of Information Technology, Hyderabad500 032, India.

Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi110016, India.

出版信息

ACS Omega. 2023 Jan 5;8(2):2389-2397. doi: 10.1021/acsomega.2c06781. eCollection 2023 Jan 17.

Abstract

Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein-ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein-ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein-ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.

摘要

药物设计涉及识别和设计与给定受体具有良好结合能力的分子的过程。该过程的一个重要计算组成部分是蛋白质-配体相互作用评分函数,它能够相当准确地评估各种分子或配体与给定蛋白质受体结合口袋的结合能力。随着以序列和结构形式公开可用的蛋白质-配体结合亲和力数据集的出现,机器学习方法作为开发此类评分函数的首选方法而受到关注。虽然这些模型所展示的性能令人乐观,但这些数据集中本身存在一些隐藏偏差,这些偏差会影响此类模型在虚拟筛选等实际应用中的效用。在这项工作中,我们使用已发表的方法系统地研究这些数据集中存在的几个此类因素或偏差。在我们的分析中,我们强调了在构建数据划分时考虑序列、蛋白质-配体相互作用和口袋结构相似性的重要性,并对某些数据集中仅蛋白质和仅配体的良好性能给出了解释。通过这项研究,我们为社区提供了一些关于结合亲和力预测器设计和数据集可靠适用性的指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e98c/9850481/0771039cca6d/ao2c06781_0002.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验