Suppr超能文献

LIT-PCBA:用于机器学习和虚拟筛选的无偏数据集。

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.

机构信息

Laboratoire d'Innovation Thérapeutique, UMR 7200 CNRS-Université de Strasbourg, 67400 Illkirch, France.

出版信息

J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.

Abstract

Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) specifically designed for virtual screening and machine learning. LIT-PCBA relies on 149 dose-response PubChem bioassays that were additionally processed to remove false positives and assay artifacts and keep active and inactive compounds within similar molecular property ranges. To ascertain that the data set is suited to both ligand-based and structure-based virtual screening, target sets were restricted to single protein targets for which at least one X-ray structure is available in complex with ligands of the same phenotype (e.g., inhibitor, inverse agonist) as that of the PubChem active compounds. Preliminary virtual screening on the 21 remaining target sets with state-of-the-art orthogonal methods (2D fingerprint similarity, 3D shape similarity, molecular docking) enabled us to select 15 target sets for which at least one of the three screening methods is able to enrich the top 1%-ranked compounds in true actives by at least a factor of 2. The corresponding ligand sets (training, validation) were finally unbiased by the recently described asymmetric validation embedding (AVE) procedure to afford the LIT-PCBA data set, consisting of 15 targets and 7844 confirmed active and 407,381 confirmed inactive compounds. The data set mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution. It is available online at http://drugdesign.unistra.fr/LIT-PCBA for download and for benchmarking novel virtual screening methods, notably those relying on machine learning.

摘要

虚拟筛选方法的比较评估需要在多样化、真实和无偏的数据集上进行严格的基准测试程序。最近来自多个研究小组的调查结果明确表明,社区经典使用的人工构建配体集(例如 DUD、DUD-E、MUV)不幸受到明显和隐藏的化学偏见的影响,因此高估了虚拟筛选方法的真实准确性。我们在此提出了一个专门为虚拟筛选和机器学习设计的新数据集(LIT-PCBA)。LIT-PCBA 依赖于 149 个基于 PubChem 的剂量反应生物测定,这些生物测定经过进一步处理以去除假阳性和测定伪影,并将活性和非活性化合物保持在相似的分子性质范围内。为了确保数据集既适合基于配体的虚拟筛选,也适合基于结构的虚拟筛选,目标集被限制为单蛋白靶标,对于这些靶标,至少有一个 X 射线结构与 PubChem 活性化合物具有相同表型(例如抑制剂、反向激动剂)的配体复合物。使用最先进的正交方法(2D 指纹相似性、3D 形状相似性、分子对接)对 21 个剩余靶标集进行初步虚拟筛选,使我们能够选择 15 个靶标集,其中至少有一种筛选方法能够将前 1%排名的化合物中的真正活性化合物富集至少 2 倍。最终,通过最近描述的不对称验证嵌入(AVE)程序对相应的配体集(训练、验证)进行无偏处理,从而提供了由 15 个靶标和 7844 个确认的活性化合物和 407381 个确认的非活性化合物组成的 LIT-PCBA 数据集。该数据集在命中率(活性化合物与非活性化合物的比例)和效力分布方面模拟了实验筛选板。它可在 http://drugdesign.unistra.fr/LIT-PCBA 上在线获取,用于下载和基准测试新型虚拟筛选方法,特别是那些依赖于机器学习的方法。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验