• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用深度学习扩展数据管理:在基因组变异资源文献分类中的应用。

Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America.

Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.

出版信息

PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.

DOI:10.1371/journal.pcbi.1006390
PMID:30102703
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6107285/
Abstract

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.

摘要

从文献中人工整理生物医学知识对于构建基于知识的服务以向用户提供高度精确和有条理的信息是必要的。用于整理相关文献的检索过程(也称为文档分类)通常通过在 PubMed 中查询和阅读文章来完成。然而,这种基于查询的方法通常在检索结果上获得不理想的精度和召回率,并且很难手动生成最佳查询。为了解决这个问题,我们提出了一种机器学习辅助分类方法。我们从 UniProtKB/Swiss-Prot 和 NHGRI-EBI GWAS Catalog 两个数据库中收集了先前经过整理的出版物,并将它们用作基于卷积神经网络的深度学习模型的训练金标准数据集。然后,我们使用训练好的模型对新出版物进行分类和排序,以进行整理。在评估中,我们将我们的方法应用于 UniProtKB/Swiss-Prot 和 GWAS Catalog 的实际手动整理过程。我们证明,我们的机器辅助分类方法优于当前基于查询的分类方法,提高了效率,并丰富了整理后的内容。我们的方法实现了比 UniProtKB/Swiss-Prot 和 GWAS Catalog 中当前基于查询的分类方法分别高 1.81 倍和 2.99 倍的精度,而不会影响召回率。事实上,我们的方法检索到了许多基于查询的 UniProtKB/Swiss-Prot 方法无法找到的相关出版物。正如这些结果所示,我们的基于机器学习的方法可以使分类过程更高效,并正在生产中实施,以便人类整理者可以专注于更具挑战性的任务,从而提高知识库的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49fc/6107285/dc18b803285d/pcbi.1006390.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49fc/6107285/a79ffa2ef2a5/pcbi.1006390.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49fc/6107285/dc18b803285d/pcbi.1006390.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49fc/6107285/a79ffa2ef2a5/pcbi.1006390.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/49fc/6107285/dc18b803285d/pcbi.1006390.g002.jpg

相似文献

1
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.利用深度学习扩展数据管理:在基因组变异资源文献分类中的应用。
PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.
2
On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.关于专业策展和可扩展性:以 UniProtKB/Swiss-Prot 为例。
Bioinformatics. 2017 Nov 1;33(21):3454-3460. doi: 10.1093/bioinformatics/btx439.
3
An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot improves consistency and reuse in ClinVar.在 UniProtKB/Swiss-Prot 中增强变体解释的工作流程可提高 ClinVar 中的一致性和重用性。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz040.
4
Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles.利用远程监督学习从全文科学文章中识别蛋白质亚细胞定位。
J Biomed Inform. 2015 Oct;57:134-44. doi: 10.1016/j.jbi.2015.07.013. Epub 2015 Jul 26.
5
UPCLASS: a deep learning-based classifier for UniProtKB entry publications.UPCLASS:一个基于深度学习的 UniProtKB 条目的出版物分类器。
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa026.
6
Using deep learning to identify translational research in genomic medicine beyond bench to bedside.利用深度学习技术识别基因组医学领域中从基础研究到临床应用的转化研究。
Database (Oxford). 2019 Jan 1;2019:baz010. doi: 10.1093/database/baz010.
7
LitCovid: an open database of COVID-19 literature.LitCovid:一个 COVID-19 文献的开放数据库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D1534-D1540. doi: 10.1093/nar/gkaa952.
8
Genetic variations and diseases in UniProtKB/Swiss-Prot: the ins and outs of expert manual curation.UniProtKB/Swiss-Prot中的基因变异与疾病:专家人工注释的来龙去脉
Hum Mutat. 2014 Aug;35(8):927-35. doi: 10.1002/humu.22594. Epub 2014 Jun 24.
9
Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt.通过自动化方法加速文章注释:NextProt 对 neXtA5 内容管理支持工具的评估。
Database (Oxford). 2018 Jan 1;2018:bay129. doi: 10.1093/database/bay129.
10
Machine learning approach to literature mining for the genetics of complex diseases.基于机器学习的复杂疾病遗传学文献挖掘方法。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz124.

引用本文的文献

1
Artificial Intelligence Transforming Post-Translational Modification Research.人工智能正在改变翻译后修饰研究。
Bioengineering (Basel). 2024 Dec 31;12(1):26. doi: 10.3390/bioengineering12010026.
2
Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species.用于提取与植物物种栖息地和繁殖条件相关关系的无监督文献挖掘方法。
Front Artif Intell. 2024 May 23;7:1371411. doi: 10.3389/frai.2024.1371411. eCollection 2024.
3
Deep learning in bioinformatics.生物信息学中的深度学习。

本文引用的文献

1
Deep learning of mutation-gene-drug relations from the literature.从文献中深度学习突变-基因-药物关系。
BMC Bioinformatics. 2018 Jan 25;19(1):21. doi: 10.1186/s12859-018-2029-1.
2
On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.关于专业策展和可扩展性:以 UniProtKB/Swiss-Prot 为例。
Bioinformatics. 2017 Nov 1;33(21):3454-3460. doi: 10.1093/bioinformatics/btx439.
3
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0:整合文献中的基因组变异信息与 dbSNP 和 ClinVar,以用于精准医学。
Turk J Biol. 2023 Dec 18;47(6):366-382. doi: 10.55730/1300-0152.2671. eCollection 2023.
4
LAMPPrimerBank, a manually curated database of experimentally validated loop-mediated isothermal amplification primers for detection of respiratory pathogens.LAMPPrimerBank,一个经实验验证的环介导等温扩增引物的手动 curated 数据库,用于检测呼吸道病原体。
Infection. 2023 Dec;51(6):1809-1818. doi: 10.1007/s15010-023-02100-0. Epub 2023 Oct 12.
5
Automatic identification of scientific publications describing digital reconstructions of neural morphology.自动识别描述神经形态数字重建的科学出版物。
Brain Inform. 2023 Sep 8;10(1):23. doi: 10.1186/s40708-023-00202-x.
6
From function to translation: Decoding genetic susceptibility to human diseases via artificial intelligence.从功能到翻译:通过人工智能解码人类疾病的遗传易感性
Cell Genom. 2023 May 4;3(6):100320. doi: 10.1016/j.xgen.2023.100320. eCollection 2023 Jun 14.
7
Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature.深度学习语言模型集合,用于支持针对 COVID-19 文献创建实时系统综述。
Syst Rev. 2023 Jun 5;12(1):94. doi: 10.1186/s13643-023-02247-9.
8
Automatic identification of scientific publications describing digital reconstructions of neural morphology.自动识别描述神经形态数字重建的科学出版物。
bioRxiv. 2023 Feb 15:2023.02.14.527522. doi: 10.1101/2023.02.14.527522.
9
Assigning species information to corresponding genes by a sequence labeling framework.通过序列标注框架为相应的基因分配物种信息。
Database (Oxford). 2022 Oct 13;2022. doi: 10.1093/database/baac090.
10
A roadmap for the functional annotation of protein families: a community perspective.蛋白质家族功能注释的路线图:社区视角。
Database (Oxford). 2022 Aug 12;2022. doi: 10.1093/database/baac062.
Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.
4
The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog).新的NHGRI-EBI已发表全基因组关联研究目录(GWAS目录)。
Nucleic Acids Res. 2017 Jan 4;45(D1):D896-D901. doi: 10.1093/nar/gkw1133. Epub 2016 Nov 29.
5
UniProt: the universal protein knowledgebase.通用蛋白质知识库:UniProt
Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. Epub 2016 Nov 29.
6
Drug drug interaction extraction from biomedical literature using syntax convolutional neural network.使用句法卷积神经网络从生物医学文献中提取药物相互作用
Bioinformatics. 2016 Nov 15;32(22):3444-3453. doi: 10.1093/bioinformatics/btw486. Epub 2016 Jul 27.
7
Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.建立用于挖掘人类遗传变异及其与疾病队列关系的文献基线。
BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3.
8
Perspective: Sustaining the big-data ecosystem.观点:维持大数据生态系统
Nature. 2015 Nov 5;527(7576):S16-7. doi: 10.1038/527S16a.
9
Machine learning for biomedical literature triage.用于生物医学文献分类的机器学习
PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.
10
Mutation extraction tools can be combined for robust recognition of genetic variants in the literature.突变提取工具可以组合起来,以便在文献中对基因变异进行可靠识别。
F1000Res. 2014 Jan 21;3:18. doi: 10.12688/f1000research.3-18.v2. eCollection 2014.