Suppr超能文献

生物文献的自动文档分类

Automatic document classification of biological literature.

作者信息

Chen David, Müller Hans-Michael, Sternberg Paul W

机构信息

Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA.

出版信息

BMC Bioinformatics. 2006 Aug 7;7:370. doi: 10.1186/1471-2105-7-370.

Abstract

BACKGROUND

Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.

RESULTS

We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

CONCLUSION

We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

摘要

背景

文档分类是一个广泛存在的问题,有许多应用场景,从组织搜索引擎片段到垃圾邮件过滤。我们之前描述了Textpresso,一种用于生物文献的文本挖掘系统,它根据一个包含生物学相关术语的浅层本体对全文进行标记。本项目在生物文献的背景下研究文档分类,利用秀丽隐杆线虫文献语料库的Textpresso标记。

结果

我们提出了一种两步文本分类算法来对秀丽隐杆线虫论文的语料库进行分类。我们的分类方法首先使用支持向量机训练的分类器,然后是一种新颖的基于短语的聚类算法。这个聚类步骤自主创建人类能够描述和理解的聚类标签。与之前发表的结果相比,这个聚类引擎在标准测试集(路透社21578)上表现更好(F值为0.55对0.49),同时生成的聚类描述似乎更有用。一个网络界面允许研究人员快速浏览层次结构并查找属于特定概念的文档。

结论

我们展示了一种对生物文档进行分类的简单方法,该方法体现了对当前方法的改进。虽然目前分类结果通过人工创建的规则针对秀丽隐杆线虫论文进行了优化,但分类引擎可以适应不同类型的文档。我们通过展示一个网络界面来证明了这一点,该界面允许研究人员快速浏览层次结构并查找属于特定概念的文档。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/187d/1559726/aee47ca52211/1471-2105-7-370-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验