Suppr超能文献

TocoDecoy:一种设计无偏数据集的新方法,用于训练和基准测试机器学习评分函数。

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.

机构信息

Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China.

State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, China.

出版信息

J Med Chem. 2022 Jun 9;65(11):7918-7932. doi: 10.1021/acs.jmedchem.2c00460. Epub 2022 Jun 1.

Abstract

Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named pology-based and nformation-based s generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.

摘要

开发基于机器学习的评分函数 (MLSFs) 以针对给定靶标进行基于结构的虚拟筛选,需要具有结构多样化的活性和虚拟化合物的大型无偏数据集。然而,大多数用于开发 MLSFs 的数据集是为传统 SFs 设计的,可能存在隐藏的偏差和数据不足。为此,我们开发了一种名为基于拓扑和信息的生成 (TocoDecoy) 的新方法,该方法通过调整特定靶标上的活性物质来生成虚拟化合物,以生成用于训练和基准测试 MLSFs 的无偏和可扩展数据集。为了评估隐藏偏差,我们评估了在 TocoDecoy、LIT-PCBA 和 DUD-E 样本文库上训练的 InteractionGraphNet (IGN) 的性能。结果表明,在 TocoDecoy 数据集上训练的 IGN 模型与在 LIT-PCBA 数据集上训练的模型具有竞争力,但明显优于在 DUD-E 数据集上训练的模型,表明 TocoDecoy 中的虚拟化合物对于训练和基准测试 MLSFs 是无偏的。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验