运用话语分析改进医学在线数据库（MEDLINE）中的文本分类

Ruch Patrick, Geissbühler Antoine, Gobeill Julien, Lisacek Frederic, Tbahriti Imad, Veuthey Anne-Lise, Aronson Alan R

Medical Informatics Service, University and Hospital of Geneva, Geneva, Switzerland.

Stud Health Technol Inform. 2007;129(Pt 1):710-5.

PROBLEM

Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks..), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval.

METHODS

In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer.

RESULTS

The most effective combination (+2%, p<0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section.

CONCLUSION

Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries.

问题

在医学信息学领域，针对MEDLINE数据库，人们对自动关键词分配进行了大量研究，其目的既在于辅助MEDLINE检索，也在于提供文章内容的指示性“要点”。有人提出使用不同方法或方法组合（包括机器学习（朴素贝叶斯、神经网络等）、基于语言学的方法（句法分析、语义标注或信息检索）来自动分配医学主题词（MeSH），这实际上是一项自动文本分类任务。

方法

在本研究中，我们提议评估科学文章的论证结构对提高分类器分类效果的影响，该分类器结合了基于语言学的方法和信息检索方法。我们的论证分类器使用从话语分析领域继承的表示层次，能够将摘要中的句子分为四类：目的；方法；结果和结论。为了进行评估，将MEDLINE的一个样本OHSUMED集合用作基准。对于集合中的每篇摘要，论证分类器的结果，即每个句子用一个论证类别进行标注，用于修改MeSH分类器的原始排名。

结果

最有效的组合（提高2%，p<0.003）严重加权了“方法”部分，适度加权了“结果”和“结论”部分。

结论

尽管幅度不大，但论证特征对文本分类带来的改进证实，话语分析方法可能有益于科学数字图书馆中的文本挖掘。

相似文献

Using discourse analysis to improve text categorization in MEDLINE.

Stud Health Technol Inform. 2007;129(Pt 1):710-5.

Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library.

Int J Med Inform. 2006 Jun;75(6):488-95. doi: 10.1016/j.ijmedinf.2005.06.007. Epub 2005 Sep 13.

Automatic assignment of biomedical categories: toward a generic approach.

Bioinformatics. 2006 Mar 15;22(6):658-64. doi: 10.1093/bioinformatics/bti783. Epub 2005 Nov 15.

Using argumentation to extract key sentences from biomedical abstracts.

Int J Med Inform. 2007 Feb-Mar;76(2-3):195-200. doi: 10.1016/j.ijmedinf.2006.05.002. Epub 2006 Jul 11.

Exploring supervised and unsupervised methods to detect topics in biomedical text.

BMC Bioinformatics. 2006 Mar 16;7:140. doi: 10.1186/1471-2105-7-140.

Ranking the whole MEDLINE database according to a large training set using text indexing.

BMC Bioinformatics. 2005 Mar 24;6:75. doi: 10.1186/1471-2105-6-75.

Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway.

Health Info Libr J. 2004 Dec;21(4):253-61. doi: 10.1111/j.1471-1842.2004.00526.x.

Reflective random indexing for semi-automatic indexing of the biomedical literature.

J Biomed Inform. 2010 Oct;43(5):694-700. doi: 10.1016/j.jbi.2010.04.001. Epub 2010 Apr 9.

Combination of text-mining algorithms increases the performance.

Bioinformatics. 2006 Sep 1;22(17):2151-7. doi: 10.1093/bioinformatics/btl281. Epub 2006 Jun 9.

The NLM Indexing Initiative's Medical Text Indexer.

Stud Health Technol Inform. 2004;107(Pt 1):268-72.

引用本文的文献

From episodes of care to diagnosis codes: automatic text categorization for medico-economic encoding.

AMIA Annu Symp Proc. 2008 Nov 6;2008:636-40.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Using discourse analysis to improve text categorization in MEDLINE.

Stud Health Technol Inform. 2007;129(Pt 1):710-5.

Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library.

Int J Med Inform. 2006 Jun;75(6):488-95. doi: 10.1016/j.ijmedinf.2005.06.007. Epub 2005 Sep 13.

Automatic assignment of biomedical categories: toward a generic approach.

Bioinformatics. 2006 Mar 15;22(6):658-64. doi: 10.1093/bioinformatics/bti783. Epub 2005 Nov 15.

Using argumentation to extract key sentences from biomedical abstracts.

Int J Med Inform. 2007 Feb-Mar;76(2-3):195-200. doi: 10.1016/j.ijmedinf.2006.05.002. Epub 2006 Jul 11.

Exploring supervised and unsupervised methods to detect topics in biomedical text.

BMC Bioinformatics. 2006 Mar 16;7:140. doi: 10.1186/1471-2105-7-140.

Ranking the whole MEDLINE database according to a large training set using text indexing.

BMC Bioinformatics. 2005 Mar 24;6:75. doi: 10.1186/1471-2105-6-75.

Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway.

Health Info Libr J. 2004 Dec;21(4):253-61. doi: 10.1111/j.1471-1842.2004.00526.x.

Reflective random indexing for semi-automatic indexing of the biomedical literature.

J Biomed Inform. 2010 Oct;43(5):694-700. doi: 10.1016/j.jbi.2010.04.001. Epub 2010 Apr 9.

Combination of text-mining algorithms increases the performance.

Bioinformatics. 2006 Sep 1;22(17):2151-7. doi: 10.1093/bioinformatics/btl281. Epub 2006 Jun 9.

The NLM Indexing Initiative's Medical Text Indexer.

Stud Health Technol Inform. 2004;107(Pt 1):268-72.

引用本文的文献

From episodes of care to diagnosis codes: automatic text categorization for medico-economic encoding.

AMIA Annu Symp Proc. 2008 Nov 6;2008:636-40.

Using discourse analysis to improve text categorization in MEDLINE.

作者信息

机构信息

出版信息

PROBLEM

METHODS

RESULTS

CONCLUSION

问题

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献