Suppr超能文献

探索用于检测生物医学文本主题的监督和无监督方法。

Exploring supervised and unsupervised methods to detect topics in biomedical text.

作者信息

Lee Minsuk, Wang Weiqing, Yu Hong

机构信息

Department of Biomedical Informatics, Columbia University, 622West, 168th Street, VC-5, NY 10032, USA.

出版信息

BMC Bioinformatics. 2006 Mar 16;7:140. doi: 10.1186/1471-2105-7-140.

Abstract

BACKGROUND

Topic detection is a task that automatically identifies topics (e.g., "biochemistry" and "protein structure") in scientific articles based on information content. Topic detection will benefit many other natural language processing tasks including information retrieval, text summarization and question answering; and is a necessary step towards the building of an information system that provides an efficient way for biologists to seek information from an ocean of literature.

RESULTS

We have explored the methods of Topic Spotting, a task of text categorization that applies the supervised machine-learning technique naïve Bayes to assign automatically a document into one or more predefined topics; and Topic Clustering, which apply unsupervised hierarchical clustering algorithms to aggregate documents into clusters such that each cluster represents a topic. We have applied our methods to detect topics of more than fifteen thousand of articles that represent over sixteen thousand entries in the Online Mendelian Inheritance in Man (OMIM) database. We have explored bag of words as the features. Additionally, we have explored semantic features; namely, the Medical Subject Headings (MeSH) that are assigned to the MEDLINE records, and the Unified Medical Language System (UMLS) semantic types that correspond to the MeSH terms, in addition to bag of words, to facilitate the tasks of topic detection. Our results indicate that incorporating the MeSH terms and the UMLS semantic types as additional features enhances the performance of topic detection and the naïve Bayes has the highest accuracy, 66.4%, for predicting the topic of an OMIM article as one of the total twenty-five topics.

CONCLUSION

Our results indicate that the supervised topic spotting methods outperformed the unsupervised topic clustering; on the other hand, the unsupervised topic clustering methods have the advantages of being robust and applicable in real world settings.

摘要

背景

主题检测是一项基于信息内容自动识别科学文章中主题(如“生物化学”和“蛋白质结构”)的任务。主题检测将有益于许多其他自然语言处理任务,包括信息检索、文本摘要和问答;并且是构建一个为生物学家提供从海量文献中高效获取信息的信息系统的必要步骤。

结果

我们探索了主题发现方法,这是一种文本分类任务,应用监督式机器学习技术朴素贝叶斯自动将文档分配到一个或多个预定义主题中;以及主题聚类,它应用无监督层次聚类算法将文档聚合成簇,使得每个簇代表一个主题。我们将我们的方法应用于检测超过一万五千篇文章的主题,这些文章代表了在线人类孟德尔遗传(OMIM)数据库中的一万六千多条记录。我们探索了以词袋作为特征。此外,除了词袋之外,我们还探索了语义特征;即分配给MEDLINE记录的医学主题词(MeSH),以及与MeSH术语相对应的统一医学语言系统(UMLS)语义类型,以促进主题检测任务。我们的结果表明,将MeSH术语和UMLS语义类型作为额外特征可提高主题检测的性能,并且朴素贝叶斯在将OMIM文章的主题预测为总共二十五个主题之一时具有最高的准确率,即66.4%。

结论

我们的结果表明,监督式主题发现方法优于无监督式主题聚类;另一方面,无监督式主题聚类方法具有稳健且适用于现实世界环境的优点。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6afd/1472693/dd275f4f59a1/1471-2105-7-140-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验