基于结构的虚拟筛选中机器学习打分函数泛化能力的评估。

Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening.

机构信息

Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.

National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China.

出版信息

J Chem Inf Model. 2022 Nov 28;62(22):5485-5502. doi: 10.1021/acs.jcim.2c01149. Epub 2022 Oct 21.

DOI:10.1021/acs.jcim.2c01149

PMID:36268980

Abstract

In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.

摘要

在基于结构的虚拟筛选 (SBVS) 中，评分函数捕捉蛋白质-配体原子相互作用至关重要。通过关注配体结合口袋的局部域，开发了一种标准化口袋 Pfam 聚类 (Pfam-cluster) 方法，以评估基于机器学习评分函数 (MLSFs) 的跨靶泛化能力。随后，使用随机交叉验证 (Random-CV)、基于蛋白质序列相似性的交叉验证 (Seq-CV) 和口袋 Pfam 交叉验证 (Pfam-CV) 方法评估了 12 种典型的 MLSFs。令人惊讶的是，所有测试模型都显示出从 Random-CV 到 Seq-CV 再到 Pfam-CV 实验的性能下降，并没有表现出令人满意的泛化能力。我们的可解释性分析表明，MLSFs 对新靶标的预测取决于复合物结构中埋藏溶剂可及表面积 (SASA) 相关特征，具有更大蛋白质-配体界面的复合物具有更大的预测结合亲和力。通过将埋藏 SASA 相关特征与仅在同一簇中结构相似化合物之间共享的目标特定模式相结合，随机森林 (RF)-Score 在 Random-CV 测试中取得了良好的性能。基于这些发现，我们强烈建议使用 Pfam-cluster 方法评估 MLSFs 的泛化能力，并谨慎对待 MLSFs 学习到的特征。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于结构的虚拟筛选中机器学习打分函数泛化能力的评估。

Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening.

机构信息

出版信息

相似文献

引用本文的文献

基于结构的虚拟筛选中机器学习打分函数泛化能力的评估。

Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening.

机构信息

出版信息

相似文献

引用本文的文献