Suppr超能文献

运用话语分析改进医学在线数据库(MEDLINE)中的文本分类

Using discourse analysis to improve text categorization in MEDLINE.

作者信息

Ruch Patrick, Geissbühler Antoine, Gobeill Julien, Lisacek Frederic, Tbahriti Imad, Veuthey Anne-Lise, Aronson Alan R

机构信息

Medical Informatics Service, University and Hospital of Geneva, Geneva, Switzerland.

出版信息

Stud Health Technol Inform. 2007;129(Pt 1):710-5.

Abstract

PROBLEM

Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks..), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval.

METHODS

In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer.

RESULTS

The most effective combination (+2%, p<0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section.

CONCLUSION

Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries.

摘要

问题

在医学信息学领域,针对MEDLINE数据库,人们对自动关键词分配进行了大量研究,其目的既在于辅助MEDLINE检索,也在于提供文章内容的指示性“要点”。有人提出使用不同方法或方法组合(包括机器学习(朴素贝叶斯、神经网络等)、基于语言学的方法(句法分析、语义标注或信息检索)来自动分配医学主题词(MeSH),这实际上是一项自动文本分类任务。

方法

在本研究中,我们提议评估科学文章的论证结构对提高分类器分类效果的影响,该分类器结合了基于语言学的方法和信息检索方法。我们的论证分类器使用从话语分析领域继承的表示层次,能够将摘要中的句子分为四类:目的;方法;结果和结论。为了进行评估,将MEDLINE的一个样本OHSUMED集合用作基准。对于集合中的每篇摘要,论证分类器的结果,即每个句子用一个论证类别进行标注,用于修改MeSH分类器的原始排名。

结果

最有效的组合(提高2%,p<0.003)严重加权了“方法”部分,适度加权了“结果”和“结论”部分。

结论

尽管幅度不大,但论证特征对文本分类带来的改进证实,话语分析方法可能有益于科学数字图书馆中的文本挖掘。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验