TocoDecoy：一种设计无偏数据集的新方法，用于训练和基准测试机器学习评分函数。

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.

机构信息

Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China.

State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, China.

出版信息

J Med Chem. 2022 Jun 9;65(11):7918-7932. doi: 10.1021/acs.jmedchem.2c00460. Epub 2022 Jun 1.

DOI:10.1021/acs.jmedchem.2c00460

PMID:35642777

Abstract

Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named pology-based and nformation-based s generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.

摘要

开发基于机器学习的评分函数 (MLSFs) 以针对给定靶标进行基于结构的虚拟筛选，需要具有结构多样化的活性和虚拟化合物的大型无偏数据集。然而，大多数用于开发 MLSFs 的数据集是为传统 SFs 设计的，可能存在隐藏的偏差和数据不足。为此，我们开发了一种名为基于拓扑和信息的生成 (TocoDecoy) 的新方法，该方法通过调整特定靶标上的活性物质来生成虚拟化合物，以生成用于训练和基准测试 MLSFs 的无偏和可扩展数据集。为了评估隐藏偏差，我们评估了在 TocoDecoy、LIT-PCBA 和 DUD-E 样本文库上训练的 InteractionGraphNet (IGN) 的性能。结果表明，在 TocoDecoy 数据集上训练的 IGN 模型与在 LIT-PCBA 数据集上训练的模型具有竞争力，但明显优于在 DUD-E 数据集上训练的模型，表明 TocoDecoy 中的虚拟化合物对于训练和基准测试 MLSFs 是无偏的。

相似文献

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.

J Med Chem. 2022 Jun 9;65(11):7918-7932. doi: 10.1021/acs.jmedchem.2c00460. Epub 2022 Jun 1.

Topology-Based and Conformation-Based Decoys Database: An Unbiased Online Database for Training and Benchmarking Machine-Learning Scoring Functions.

J Med Chem. 2023 Jul 13;66(13):9174-9183. doi: 10.1021/acs.jmedchem.3c00801. Epub 2023 Jun 14.

Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening?

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbaa410.

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.

J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.

ML-PLIC: a web platform for characterizing protein-ligand interactions and developing machine learning-based scoring functions.

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad295.

Beware of the generic machine learning-based scoring functions in structure-based virtual screening.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa070.

MILCDock: Machine Learning Enhanced Consensus Docking for Virtual Screening in Drug Discovery.

J Chem Inf Model. 2022 Nov 28;62(22):5342-5350. doi: 10.1021/acs.jcim.2c00705. Epub 2022 Nov 7.

TB-IECS: an accurate machine learning-based scoring function for virtual screening.

J Cheminform. 2023 Jul 4;15(1):63. doi: 10.1186/s13321-023-00731-x.

Data-augmented machine learning scoring functions for virtual screening of YTHDF1 mA reader protein.

Comput Biol Med. 2024 Dec;183:109268. doi: 10.1016/j.compbiomed.2024.109268. Epub 2024 Oct 12.

Improving structure-based virtual screening performance via learning from scoring function components.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa094.

引用本文的文献

SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction.

Nat Methods. 2025 Feb;22(2):310-322. doi: 10.1038/s41592-024-02516-y. Epub 2024 Nov 27.

Integrated Molecular Modeling and Machine Learning for Drug Design.

J Chem Theory Comput. 2023 Nov 14;19(21):7478-7495. doi: 10.1021/acs.jctc.3c00814. Epub 2023 Oct 26.

Open-Source Machine Learning in Computational Chemistry.

J Chem Inf Model. 2023 Aug 14;63(15):4505-4532. doi: 10.1021/acs.jcim.3c00643. Epub 2023 Jul 19.

TB-IECS: an accurate machine learning-based scoring function for virtual screening.

J Cheminform. 2023 Jul 4;15(1):63. doi: 10.1186/s13321-023-00731-x.

Comprehensive Survey of Consensus Docking for High-Throughput Virtual Screening.

Molecules. 2022 Dec 25;28(1):175. doi: 10.3390/molecules28010175.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

TocoDecoy：一种设计无偏数据集的新方法，用于训练和基准测试机器学习评分函数。

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献