生物文献的自动文档分类

Automatic document classification of biological literature.

作者信息

Chen David, Müller Hans-Michael, Sternberg Paul W

机构信息

Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA.

出版信息

BMC Bioinformatics. 2006 Aug 7;7:370. doi: 10.1186/1471-2105-7-370.

DOI:10.1186/1471-2105-7-370

PMID:16893465

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1559726/

Abstract

BACKGROUND

Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.

RESULTS

We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

CONCLUSION

We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

摘要

背景

文档分类是一个广泛存在的问题，有许多应用场景，从组织搜索引擎片段到垃圾邮件过滤。我们之前描述了Textpresso，一种用于生物文献的文本挖掘系统，它根据一个包含生物学相关术语的浅层本体对全文进行标记。本项目在生物文献的背景下研究文档分类，利用秀丽隐杆线虫文献语料库的Textpresso标记。

结果

我们提出了一种两步文本分类算法来对秀丽隐杆线虫论文的语料库进行分类。我们的分类方法首先使用支持向量机训练的分类器，然后是一种新颖的基于短语的聚类算法。这个聚类步骤自主创建人类能够描述和理解的聚类标签。与之前发表的结果相比，这个聚类引擎在标准测试集（路透社21578）上表现更好（F值为0.55对0.49），同时生成的聚类描述似乎更有用。一个网络界面允许研究人员快速浏览层次结构并查找属于特定概念的文档。

结论

我们展示了一种对生物文档进行分类的简单方法，该方法体现了对当前方法的改进。虽然目前分类结果通过人工创建的规则针对秀丽隐杆线虫论文进行了优化，但分类引擎可以适应不同类型的文档。我们通过展示一个网络界面来证明了这一点，该界面允许研究人员快速浏览层次结构并查找属于特定概念的文档。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187d/1559726/aee47ca52211/1471-2105-7-370-1.jpg

相似文献

Automatic document classification of biological literature.生物文献的自动文档分类

BMC Bioinformatics. 2006 Aug 7;7:370. doi: 10.1186/1471-2105-7-370.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.Textpresso：一个基于本体的生物文献信息检索与提取系统。

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.Textpresso 中心：一个可定制的平台，用于搜索、文本挖掘、查看和管理生物医学文献。

BMC Bioinformatics. 2018 Mar 9;19(1):94. doi: 10.1186/s12859-018-2103-8.

The BioPrompt-box: an ontology-based clustering tool for searching in biological databases.生物提示框：一种用于在生物数据库中搜索的基于本体的聚类工具。

BMC Bioinformatics. 2007 Mar 8;8 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2105-8-S1-S8.

An entity tagger for recognizing acquired genomic variations in cancer literature.一种用于识别癌症文献中获得性基因组变异的实体标记器。

Bioinformatics. 2004 Nov 22;20(17):3249-51. doi: 10.1093/bioinformatics/bth350. Epub 2004 Jun 4.

BioRAT: extracting biological information from full-length papers.BioRAT：从全文论文中提取生物学信息。

Bioinformatics. 2004 Nov 22;20(17):3206-13. doi: 10.1093/bioinformatics/bth386. Epub 2004 Jul 1.

Essie: a concept-based search engine for structured biomedical text.Essie：一个用于结构化生物医学文本的基于概念的搜索引擎。

J Am Med Inform Assoc. 2007 May-Jun;14(3):253-63. doi: 10.1197/jamia.M2233. Epub 2007 Feb 28.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation.蛋白质亚细胞定位的半自动管理：一种基于文本挖掘的基因本体论（GO）细胞组分管理方法。

BMC Bioinformatics. 2009 Jul 21;10:228. doi: 10.1186/1471-2105-10-228.

FigSearch: a figure legend indexing and classification system.FigSearch：一种图注索引与分类系统。

Bioinformatics. 2004 Nov 1;20(16):2880-2. doi: 10.1093/bioinformatics/bth316. Epub 2004 May 14.

引用本文的文献

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.一种有效的支持生物注释的生物医学文献分类方案：解决类不平衡问题。

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz045.

A statistical approach to identify, monitor, and manage incomplete curated data sets.一种用于识别、监测和管理未完成编目数据集的统计方法。

BMC Bioinformatics. 2018 Apr 2;19(1):110. doi: 10.1186/s12859-018-2121-6.

Community challenges in biomedical text mining over 10 years: success, failure and the future.十年来生物医学文本挖掘中的社区挑战：成功、失败与未来。

Brief Bioinform. 2016 Jan;17(1):132-44. doi: 10.1093/bib/bbv024. Epub 2015 May 1.

Representing and extracting lung cancer study metadata: study objective and study design.呈现和提取肺癌研究元数据：研究目的与研究设计。

Comput Biol Med. 2015 Mar;58:63-72. doi: 10.1016/j.compbiomed.2015.01.004. Epub 2015 Jan 13.

Considerations for creating and annotating the budding yeast Genome Map at SGD: a progress report.考虑为创建和注释酵母基因组图谱数据库（SGD）而进行的工作：进展报告。

Database (Oxford). 2012 Mar 20;2012:bar057. doi: 10.1093/database/bar057. Print 2012.

Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature.从生物医学文献中检测蛋白质-蛋白质相互作用的实验技术并选择相关文献。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S11. doi: 10.1186/1471-2105-12-S8-S11.

WormBase 2012: more genomes, more data, new website.2012 年的 WormBase：更多的基因组、更多的数据、全新的网站。

Nucleic Acids Res. 2012 Jan;40(Database issue):D735-41. doi: 10.1093/nar/gkr954. Epub 2011 Nov 8.

Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study.利用计算预测改进基于文献的基因本体论注释：一项可行性研究。

Database (Oxford). 2011 Mar 15;2011:bar004. doi: 10.1093/database/bar004. Print 2011.

Integration of open access literature into the RCSB Protein Data Bank using BioLit.利用 BioLit 将开放获取文献整合到 RCSB 蛋白质数据库中。

BMC Bioinformatics. 2010 Apr 29;11:220. doi: 10.1186/1471-2105-11-220.

Word add-in for ontology recognition: semantic enrichment of scientific literature.本体识别的 Word 加载项：科学文献的语义丰富。

BMC Bioinformatics. 2010 Feb 24;11:103. doi: 10.1186/1471-2105-11-103.

本文引用的文献

Literature mining for the biologist: from information retrieval to biological discovery.面向生物学家的文献挖掘：从信息检索到生物学发现

Nat Rev Genet. 2006 Feb;7(2):119-29. doi: 10.1038/nrg1768.

Textpresso: an ontology-based information retrieval and extraction system for biological literature.Textpresso：一个基于本体的生物文献信息检索与提取系统。

PLoS Biol. 2004 Nov;2(11):e309. doi: 10.1371/journal.pbio.0020309. Epub 2004 Sep 21.

Tough mining: the challenges of searching the scientific literature.艰难的挖掘：搜索科学文献的挑战。

PLoS Biol. 2003 Nov;1(2):E48. doi: 10.1371/journal.pbio.0000048. Epub 2003 Nov 17.

Getting to the (c)ore of knowledge: mining biomedical literature.触及知识的（核）心：挖掘生物医学文献。

Int J Med Inform. 2002 Dec 4;67(1-3):7-18. doi: 10.1016/s1386-5056(02)00050-3.

Automated extraction of information in molecular biology.分子生物学中信息的自动提取

FEBS Lett. 2000 Jun 30;476(1-2):12-7. doi: 10.1016/s0014-5793(00)01661-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

生物文献的自动文档分类

Automatic document classification of biological literature.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献