全面且定量地比较了 1500 万篇全文文章及其相应摘要中的文本挖掘。

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

机构信息

Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark.

Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.

出版信息

PLoS Comput Biol. 2018 Feb 15;14(2):e1005962. doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

DOI:10.1371/journal.pcbi.1005962

PMID:29447159

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5831415/

Abstract

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

摘要

在学术界和工业界，文本挖掘已成为跟上科学文献快速增长的一种流行策略。由于摘要的可用性，科学文献的文本挖掘主要在摘要集合上进行。在这里，我们分析了 1500 万篇发表于 1823 年至 2016 年期间的英文科学全文文章。我们描述了近 250 年来文章长度和出版子主题的发展情况。我们展示了使用命名实体识别系统提取已发表的蛋白质-蛋白质、疾病-基因和蛋白质亚细胞关联的潜力，并使用黄金标准基准数据集定量报告其准确性。随后，我们将这些发现与包含在 MEDLINE 中的 1650 万摘要的相应结果进行了比较，并表明仅使用摘要进行文本挖掘始终优于全文文章的文本挖掘。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e8f4/5831415/b54969f0e7d3/pcbi.1005962.g001.jpg

相似文献

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.全面且定量地比较了 1500 万篇全文文章及其相应摘要中的文本挖掘。

PLoS Comput Biol. 2018 Feb 15;14(2):e1005962. doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

Information content in Medline record fields.医学在线数据库（Medline）记录字段中的信息内容。

Int J Med Inform. 2004 Jun 30;73(6):515-27. doi: 10.1016/j.ijmedinf.2004.02.008.

The structural and content aspects of abstracts versus bodies of full text journal articles are different.文摘的结构和内容方面与全文期刊文章的不同。

BMC Bioinformatics. 2010 Sep 29;11:492. doi: 10.1186/1471-2105-11-492.

Distribution of information in biomedical abstracts and full-text publications.生物医学摘要和全文出版物中的信息分布。

Bioinformatics. 2004 Nov 1;20(16):2597-604. doi: 10.1093/bioinformatics/bth291. Epub 2004 May 6.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.Textpresso：一个基于本体的生物文献信息检索与提取系统。

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Building a protein name dictionary from full text: a machine learning term extraction approach.从全文构建蛋白质名称词典：一种机器学习术语提取方法。

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup.用于数据库管理的文本数据挖掘评估：从知识发现与数据挖掘竞赛杯赛中学到的经验教训。

Bioinformatics. 2003;19 Suppl 1:i331-9. doi: 10.1093/bioinformatics/btg1046.

Extracting Characteristics of the Study Subjects from Full-Text Articles.从全文文章中提取研究对象的特征。

AMIA Annu Symp Proc. 2015 Nov 5;2015:484-91. eCollection 2015.

Text mining for clinical support.文本挖掘在临床支持中的应用。

J Med Libr Assoc. 2019 Oct;107(4):603-605. doi: 10.5195/jmla.2019.758. Epub 2019 Oct 1.

Text mining tools for extracting information about microbial biodiversity in food.用于从食品中提取微生物生物多样性信息的文本挖掘工具。

Food Microbiol. 2019 Aug;81:63-75. doi: 10.1016/j.fm.2018.04.011. Epub 2018 Apr 21.

引用本文的文献

Unlocking the potential of PubMed Central supplementary data files.挖掘PubMed Central补充数据文件的潜力。

Bioinform Adv. 2025 Jun 27;5(1):vbaf155. doi: 10.1093/bioadv/vbaf155. eCollection 2025.

LitSense 2.0: AI-powered biomedical information retrieval with sentence and passage level knowledge discovery.LitSense 2.0：具有句子和段落级知识发现功能的人工智能驱动的生物医学信息检索。

Nucleic Acids Res. 2025 Jul 7;53(W1):W361-W368. doi: 10.1093/nar/gkaf417.

Socio-environmental modeling shows physics-like confidence with water modeling surpassing it in numerical claims.社会环境建模在数值声称方面显示出类似物理学的置信度，其中水模型的置信度超过了它。

iScience. 2025 Mar 13;28(4):112184. doi: 10.1016/j.isci.2025.112184. eCollection 2025 Apr 18.

EnzChemRED, a rich enzyme chemistry relation extraction dataset.EnzChemRED，一个富含酶化学关系提取的数据集。

Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.

Improving dictionary-based named entity recognition with deep learning.利用深度学习改进基于字典的命名实体识别。

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.

Decoding hotline's information with text-mining: A protocol for improving tobacco control in Shanghai.运用文本挖掘技术解读热线信息：上海市改善烟草控制的一项方案

Tob Induc Dis. 2024 Jun 17;22. doi: 10.18332/tid/187864. eCollection 2024.

The application of natural language processing for the extraction of mechanistic information in toxicology.自然语言处理在毒理学中用于提取机制信息的应用。

Front Toxicol. 2024 May 10;6:1393662. doi: 10.3389/ftox.2024.1393662. eCollection 2024.

Unsupervised learning and natural language processing highlight research trends in a superbug.无监督学习和自然语言处理突出了一种超级细菌的研究趋势。

Front Artif Intell. 2024 Mar 21;7:1336071. doi: 10.3389/frai.2024.1336071. eCollection 2024.

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.PubTator 3.0：一款人工智能驱动的文献资源，用于解锁生物医学知识。

Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms.欧洲 PMC 注释全文生物库，包含基因/蛋白质、疾病和生物信息。

Sci Data. 2023 Oct 19;10(1):722. doi: 10.1038/s41597-023-02617-x.

本文引用的文献

The readability of scientific texts is decreasing over time.科学文献的可读性随着时间的推移而降低。

Elife. 2017 Sep 5;6:e27725. doi: 10.7554/eLife.27725.

BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.

KEGG: new perspectives on genomes, pathways, diseases and drugs.京都基因与基因组百科全书（KEGG）：关于基因组、通路、疾病和药物的新视角。

Nucleic Acids Res. 2017 Jan 4;45(D1):D353-D361. doi: 10.1093/nar/gkw1092. Epub 2016 Nov 28.

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health.精准医学的文本挖掘：为电子健康记录和生物医学文献构建结构以理解基因与健康。

Adv Exp Med Biol. 2016;939:139-166. doi: 10.1007/978-981-10-1503-8_7.

Text mining patents for biomedical knowledge.挖掘生物医学知识的专利文本。

Drug Discov Today. 2016 Jun;21(6):997-1002. doi: 10.1016/j.drudis.2016.05.002. Epub 2016 May 11.

Text Mining the History of Medicine.挖掘医学史

PLoS One. 2016 Jan 6;11(1):e0144717. doi: 10.1371/journal.pone.0144717. eCollection 2016.

STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data.STITCH 5：利用组织和亲和力数据扩充蛋白质-化学相互作用网络。

Nucleic Acids Res. 2016 Jan 4;44(D1):D380-4. doi: 10.1093/nar/gkv1277. Epub 2015 Nov 20.

KEGG as a reference resource for gene and protein annotation.KEGG作为基因和蛋白质注释的参考资源。

Nucleic Acids Res. 2016 Jan 4;44(D1):D457-62. doi: 10.1093/nar/gkv1070. Epub 2015 Oct 17.

Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.用于生物医学发现的文本与数据挖掘的最新进展及新兴应用

Brief Bioinform. 2016 Jan;17(1):33-42. doi: 10.1093/bib/bbv087. Epub 2015 Sep 29.

Large-scale extraction of gene interactions from full-text literature using DeepDive.使用DeepDive从全文文献中大规模提取基因相互作用。

Bioinformatics. 2016 Jan 1;32(1):106-13. doi: 10.1093/bioinformatics/btv476. Epub 2015 Sep 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

全面且定量地比较了 1500 万篇全文文章及其相应摘要中的文本挖掘。

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献