LIT-PCBA：用于机器学习和虚拟筛选的无偏数据集。

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.

机构信息

Laboratoire d'Innovation Thérapeutique, UMR 7200 CNRS-Université de Strasbourg, 67400 Illkirch, France.

出版信息

J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.

DOI:10.1021/acs.jcim.0c00155

Abstract

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) specifically designed for virtual screening and machine learning. LIT-PCBA relies on 149 dose-response PubChem bioassays that were additionally processed to remove false positives and assay artifacts and keep active and inactive compounds within similar molecular property ranges. To ascertain that the data set is suited to both ligand-based and structure-based virtual screening, target sets were restricted to single protein targets for which at least one X-ray structure is available in complex with ligands of the same phenotype (e.g., inhibitor, inverse agonist) as that of the PubChem active compounds. Preliminary virtual screening on the 21 remaining target sets with state-of-the-art orthogonal methods (2D fingerprint similarity, 3D shape similarity, molecular docking) enabled us to select 15 target sets for which at least one of the three screening methods is able to enrich the top 1%-ranked compounds in true actives by at least a factor of 2. The corresponding ligand sets (training, validation) were finally unbiased by the recently described asymmetric validation embedding (AVE) procedure to afford the LIT-PCBA data set, consisting of 15 targets and 7844 confirmed active and 407,381 confirmed inactive compounds. The data set mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution. It is available online at http://drugdesign.unistra.fr/LIT-PCBA for download and for benchmarking novel virtual screening methods, notably those relying on machine learning.

摘要

虚拟筛选方法的比较评估需要在多样化、真实和无偏的数据集上进行严格的基准测试程序。最近来自多个研究小组的调查结果明确表明，社区经典使用的人工构建配体集（例如 DUD、DUD-E、MUV）不幸受到明显和隐藏的化学偏见的影响，因此高估了虚拟筛选方法的真实准确性。我们在此提出了一个专门为虚拟筛选和机器学习设计的新数据集（LIT-PCBA）。LIT-PCBA 依赖于 149 个基于 PubChem 的剂量反应生物测定，这些生物测定经过进一步处理以去除假阳性和测定伪影，并将活性和非活性化合物保持在相似的分子性质范围内。为了确保数据集既适合基于配体的虚拟筛选，也适合基于结构的虚拟筛选，目标集被限制为单蛋白靶标，对于这些靶标，至少有一个 X 射线结构与 PubChem 活性化合物具有相同表型（例如抑制剂、反向激动剂）的配体复合物。使用最先进的正交方法（2D 指纹相似性、3D 形状相似性、分子对接）对 21 个剩余靶标集进行初步虚拟筛选，使我们能够选择 15 个靶标集，其中至少有一种筛选方法能够将前 1%排名的化合物中的真正活性化合物富集至少 2 倍。最终，通过最近描述的不对称验证嵌入（AVE）程序对相应的配体集（训练、验证）进行无偏处理，从而提供了由 15 个靶标和 7844 个确认的活性化合物和 407381 个确认的非活性化合物组成的 LIT-PCBA 数据集。该数据集在命中率（活性化合物与非活性化合物的比例）和效力分布方面模拟了实验筛选板。它可在 http://drugdesign.unistra.fr/LIT-PCBA 上在线获取，用于下载和基准测试新型虚拟筛选方法，特别是那些依赖于机器学习的方法。

相似文献

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.LIT-PCBA：用于机器学习和虚拟筛选的无偏数据集。

J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.

MILCDock: Machine Learning Enhanced Consensus Docking for Virtual Screening in Drug Discovery.MILCDock：用于药物发现虚拟筛选的机器学习增强共识对接。

J Chem Inf Model. 2022 Nov 28;62(22):5342-5350. doi: 10.1021/acs.jcim.2c00705. Epub 2022 Nov 7.

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.TocoDecoy：一种设计无偏数据集的新方法，用于训练和基准测试机器学习评分函数。

J Med Chem. 2022 Jun 9;65(11):7918-7932. doi: 10.1021/acs.jmedchem.2c00460. Epub 2022 Jun 1.

FRAGSITE: A Fragment-Based Approach for Virtual Ligand Screening.FRAGSITE：基于片段的虚拟配体筛选方法。

J Chem Inf Model. 2021 Apr 26;61(4):2074-2089. doi: 10.1021/acs.jcim.0c01160. Epub 2021 Mar 16.

True Accuracy of Fast Scoring Functions to Predict High-Throughput Screening Data from Docking Poses: The Simpler the Better.快速评分函数预测对接构象高通量筛选数据的真实准确性：越简单越好。

J Chem Inf Model. 2021 Jun 28;61(6):2788-2797. doi: 10.1021/acs.jcim.1c00292. Epub 2021 Jun 10.

Docking Score ML: Target-Specific Machine Learning Models Improving Docking-Based Virtual Screening in 155 Targets.对接评分 ML：针对 155 个靶标，基于对接的虚拟筛选的目标特异性机器学习模型的改进。

J Chem Inf Model. 2024 Jul 22;64(14):5413-5426. doi: 10.1021/acs.jcim.4c00072. Epub 2024 Jul 3.

Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening?准确性还是新颖性：在虚拟筛选中，基于目标的机器学习打分函数能为我们带来什么？

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbaa410.

Protein-Ligand Docking in the Machine-Learning Era.蛋白质-配体对接在机器学习时代。

Molecules. 2022 Jul 18;27(14):4568. doi: 10.3390/molecules27144568.

Toward a benchmarking data set able to evaluate ligand- and structure-based virtual screening using public HTS data.构建一个基准数据集，用于利用公开的高通量筛选数据评估基于配体和结构的虚拟筛选。

J Chem Inf Model. 2015 Feb 23;55(2):343-53. doi: 10.1021/ci5005465. Epub 2015 Jan 28.

Accurate prediction of protein-ligand interactions by combining physical energy functions and graph-neural networks.通过结合物理能量函数和图神经网络准确预测蛋白质-配体相互作用。

J Cheminform. 2024 Nov 4;16(1):121. doi: 10.1186/s13321-024-00912-2.

引用本文的文献

Multiscale topology-enabled structure-to-sequence transformer for protein-ligand interaction predictions.用于蛋白质-配体相互作用预测的多尺度拓扑结构到序列变压器

Nat Mach Intell. 2024 Jul;6(7):799-810. doi: 10.1038/s42256-024-00855-1. Epub 2024 Jun 21.

Benchmarking 3D Structure-Based Molecule Generators.基于3D结构的分子生成器的基准测试

J Chem Inf Model. 2025 Aug 11;65(15):8006-8021. doi: 10.1021/acs.jcim.5c01020. Epub 2025 Jul 25.

HERGAI: an artificial intelligence tool for structure-based prediction of hERG inhibitors.HERGAI：一种基于结构预测hERG抑制剂的人工智能工具。

J Cheminform. 2025 Jul 24;17(1):110. doi: 10.1186/s13321-025-01063-8.

Simpatico: accurate and ultra-fast virtual drug screening with atomic embeddings.辛帕提科：利用原子嵌入进行准确且超快速的虚拟药物筛选。

bioRxiv. 2025 Jun 8:2025.06.08.658499. doi: 10.1101/2025.06.08.658499.

ColdstartCPI: Induced-fit theory-guided DTI predictive model with improved generalization performance.ColdstartCPI：基于诱导契合理论指导的具有改进泛化性能的DTI预测模型。

Nat Commun. 2025 Jul 11;16(1):6436. doi: 10.1038/s41467-025-61745-7.

Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.用于从头药物设计的生成式深度学习——一场化学空间奥德赛。

J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.

Multimodal fusion with relational learning for molecular property prediction.用于分子性质预测的基于关系学习的多模态融合

Commun Chem. 2025 Jul 5;8(1):200. doi: 10.1038/s42004-025-01586-z.

AI-Driven Drug Discovery: A Comprehensive Review.人工智能驱动的药物发现：全面综述。

ACS Omega. 2025 Jun 6;10(23):23889-23903. doi: 10.1021/acsomega.5c00549. eCollection 2025 Jun 17.

Advancing active compound discovery for novel drug targets: insights from AI-driven approaches.推进针对新型药物靶点的活性化合物发现：人工智能驱动方法的见解。

Acta Pharmacol Sin. 2025 Jun 17. doi: 10.1038/s41401-025-01591-x.

Assessing the Robustness and Scalability of Machine Learning Methods to Accelerate Ultralarge High-Throughput Docking Campaigns.评估机器学习方法的稳健性和可扩展性以加速超大高通量对接活动

ACS Omega. 2025 Apr 7;10(15):15598-15609. doi: 10.1021/acsomega.5c00829. eCollection 2025 Apr 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

LIT-PCBA：用于机器学习和虚拟筛选的无偏数据集。

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献