• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

纸莎草纸:一个旨在进行生物活性预测的大规模精选数据集。

Papyrus: a large-scale curated dataset aimed at bioactivity predictions.

作者信息

Béquignon O J M, Bongers B J, Jespers W, IJzerman A P, van der Water B, van Westen G J P

机构信息

Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands.

出版信息

J Cheminform. 2023 Jan 6;15(1):3. doi: 10.1186/s13321-022-00672-x.

DOI:10.1186/s13321-022-00672-x
PMID:36609528
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9824924/
Abstract

With the ongoing rapid growth of publicly available ligand-protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine-learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers' time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges, we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million data points. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some examples of quantitative structure-activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing an accessible data source for research.

摘要

随着公开可用的配体-蛋白质生物活性数据持续快速增长,有大量宝贵数据可用于训练众多机器学习算法。然而,并非所有数据在规模和质量上都是等同的,研究人员需要花费大量时间来使数据符合他们的需求。除此之外,为一个研究问题找到合适的数据本身往往就是一项挑战。为应对这些挑战,我们构建了纸莎草纸数据集。纸莎草纸数据集由大约6000万个数据点组成。该数据集包含多个大型公开可用数据集,如ChEMBL和ExCAPE-DB,还结合了几个包含高质量数据的较小数据集。汇总后的数据已以适合机器学习的方式进行了标准化和归一化处理。我们展示了如何以多种方式对数据进行筛选,还进行了一些定量构效关系分析和蛋白质化学计量学建模的示例。我们的目标是,这个经过精简的数据集合构成一个可用于构建预测模型的基准集,同时也为研究提供一个易于获取的数据源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/362b5e13f4c2/13321_2022_672_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/66073159755f/13321_2022_672_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/b60c1caf3424/13321_2022_672_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/5a472b34de54/13321_2022_672_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/5976e7f7a0df/13321_2022_672_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/362b5e13f4c2/13321_2022_672_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/66073159755f/13321_2022_672_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/b60c1caf3424/13321_2022_672_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/5a472b34de54/13321_2022_672_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/5976e7f7a0df/13321_2022_672_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f36/9824924/362b5e13f4c2/13321_2022_672_Fig5_HTML.jpg

相似文献

1
Papyrus: a large-scale curated dataset aimed at bioactivity predictions.纸莎草纸:一个旨在进行生物活性预测的大规模精选数据集。
J Cheminform. 2023 Jan 6;15(1):3. doi: 10.1186/s13321-022-00672-x.
2
ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.ExCAPE-DB:一个促进化学基因组学大数据分析的综合大规模数据集。
J Cheminform. 2017 Mar 7;9:17. doi: 10.1186/s13321-017-0203-5. eCollection 2017.
3
Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set.超越炒作:使用ChEMBL生物活性基准集,深度神经网络优于现有方法。
J Cheminform. 2017 Aug 14;9(1):45. doi: 10.1186/s13321-017-0232-0.
4
Industry-scale application and evaluation of deep learning for drug target prediction.深度学习在药物靶点预测中的工业规模应用与评估
J Cheminform. 2020 Apr 19;12(1):26. doi: 10.1186/s13321-020-00428-5.
5
A Consensus Compound/Bioactivity Dataset for Data-Driven Drug Design and Chemogenomics.用于数据驱动药物设计和化学生物组学的共识化合物/生物活性数据集。
Molecules. 2022 Apr 13;27(8):2513. doi: 10.3390/molecules27082513.
6
A Large-Scale Open Motion Dataset (KFall) and Benchmark Algorithms for Detecting Pre-impact Fall of the Elderly Using Wearable Inertial Sensors.一个用于使用可穿戴惯性传感器检测老年人撞击前跌倒的大规模开放运动数据集(KFall)及基准算法
Front Aging Neurosci. 2021 Jul 16;13:692865. doi: 10.3389/fnagi.2021.692865. eCollection 2021.
7
QDataSet, quantum datasets for machine learning.QDataSet,用于机器学习的量子数据集。
Sci Data. 2022 Sep 23;9(1):582. doi: 10.1038/s41597-022-01639-1.
8
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
9
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
10
Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery.利用 5000 多个数据集进行药物发现的多种机器学习算法的生物活性比较。
Mol Pharm. 2021 Jan 4;18(1):403-415. doi: 10.1021/acs.molpharmaceut.0c01013. Epub 2020 Dec 16.

引用本文的文献

1
Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context.迈向可感知分析的生物活性模型:把握生物学背景。
J Chem Inf Model. 2025 Jul 14;65(13):7013-7023. doi: 10.1021/acs.jcim.5c00603. Epub 2025 Jun 30.
2
Integrating Pharmacokinetics and Quantitative Systems Pharmacology Approaches in Generative Drug Design.在生成式药物设计中整合药代动力学和定量系统药理学方法。
J Chem Inf Model. 2025 May 26;65(10):4783-4796. doi: 10.1021/acs.jcim.5c00107. Epub 2025 May 9.
3
Enhancing Transthyretin Binding Affinity Prediction with a Consensus Model: Insights from the Tox24 Challenge.

本文引用的文献

1
MolData, a molecular benchmark for disease and target based machine learning.MolData,一种基于疾病和靶点的机器学习分子基准。
J Cheminform. 2022 Mar 7;14(1):10. doi: 10.1186/s13321-022-00590-y.
2
Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study.深度生成模型中基于结构和配体的评分函数比较:以G蛋白偶联受体为例的研究
J Cheminform. 2021 May 13;13(1):39. doi: 10.1186/s13321-021-00516-0.
3
Trends in peptide drug discovery.肽类药物研发趋势。
用共识模型增强转甲状腺素蛋白结合亲和力预测:来自Tox24挑战赛的见解
Chem Res Toxicol. 2025 May 19;38(5):900-908. doi: 10.1021/acs.chemrestox.4c00560. Epub 2025 Apr 26.
4
Generate what you can make: achieving in-house synthesizability with readily available resources in de novo drug design.利用现有资源实现从头药物设计中的内部合成可行性:生成你所能制备的物质。
J Cheminform. 2025 Mar 28;17(1):41. doi: 10.1186/s13321-024-00910-4.
5
Normalized Protein-Ligand Distance Likelihood Score for End-to-End Blind Docking and Virtual Screening.用于端到端盲对接和虚拟筛选的归一化蛋白质-配体距离似然得分
J Chem Inf Model. 2025 Feb 10;65(3):1101-1114. doi: 10.1021/acs.jcim.4c01014. Epub 2025 Jan 17.
6
Docking-Informed Machine Learning for Kinome-wide Affinity Prediction.基于对接信息的机器学习用于全激酶组亲和力预测
J Chem Inf Model. 2024 Dec 23;64(24):9196-9204. doi: 10.1021/acs.jcim.4c01260. Epub 2024 Dec 10.
7
QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool.QSPRpred:一个灵活的开源定量结构-性质关系建模工具。
J Cheminform. 2024 Nov 14;16(1):128. doi: 10.1186/s13321-024-00908-y.
8
AlphaFold Meets De Novo Drug Design: Leveraging Structural Protein Information in Multitarget Molecular Generative Models.AlphaFold 遇见从头药物设计:在多靶标分子生成模型中利用结构蛋白信息。
J Chem Inf Model. 2024 Nov 11;64(21):8113-8122. doi: 10.1021/acs.jcim.4c00309. Epub 2024 Oct 30.
9
Chemoenzymatic multistep retrosynthesis with transformer loops.采用变换循环的化学酶多步逆合成
Chem Sci. 2024 Oct 8;15(43):18031-47. doi: 10.1039/d4sc02408g.
10
CPSign: conformal prediction for cheminformatics modeling.CPSign:用于化学信息学建模的共形预测
J Cheminform. 2024 Jun 28;16(1):75. doi: 10.1186/s13321-024-00870-9.
Nat Rev Drug Discov. 2021 Apr;20(4):309-325. doi: 10.1038/s41573-020-00135-8. Epub 2021 Feb 3.
4
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
5
Visualization of very large high-dimensional data sets as minimum spanning trees.将超大型高维数据集可视化为最小生成树。
J Cheminform. 2020 Feb 12;12(1):12. doi: 10.1186/s13321-020-0416-x.
6
Enhancing Chemogenomics with Predictive Pharmacology.增强化的化学基因组学与预测药理学
J Med Chem. 2020 Nov 12;63(21):12243-12255. doi: 10.1021/acs.jmedchem.0c00445. Epub 2020 Jul 6.
7
Advances in exploring activity cliffs.探索活动悬崖的进展。
J Comput Aided Mol Des. 2020 Sep;34(9):929-942. doi: 10.1007/s10822-020-00315-z. Epub 2020 May 5.
8
LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.LIT-PCBA:用于机器学习和虚拟筛选的无偏数据集。
J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.
9
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
10
Integrated evolutionary analysis reveals antimicrobial peptides with limited resistance.综合进化分析揭示具有有限耐药性的抗菌肽。
Nat Commun. 2019 Oct 4;10(1):4538. doi: 10.1038/s41467-019-12364-6.