Ruch Patrick, Geissbühler Antoine, Gobeill Julien, Lisacek Frederic, Tbahriti Imad, Veuthey Anne-Lise, Aronson Alan R
Medical Informatics Service, University and Hospital of Geneva, Geneva, Switzerland.
Stud Health Technol Inform. 2007;129(Pt 1):710-5.
Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks..), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval.
In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer.
The most effective combination (+2%, p<0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section.
Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries.
在医学信息学领域,针对MEDLINE数据库,人们对自动关键词分配进行了大量研究,其目的既在于辅助MEDLINE检索,也在于提供文章内容的指示性“要点”。有人提出使用不同方法或方法组合(包括机器学习(朴素贝叶斯、神经网络等)、基于语言学的方法(句法分析、语义标注或信息检索)来自动分配医学主题词(MeSH),这实际上是一项自动文本分类任务。
在本研究中,我们提议评估科学文章的论证结构对提高分类器分类效果的影响,该分类器结合了基于语言学的方法和信息检索方法。我们的论证分类器使用从话语分析领域继承的表示层次,能够将摘要中的句子分为四类:目的;方法;结果和结论。为了进行评估,将MEDLINE的一个样本OHSUMED集合用作基准。对于集合中的每篇摘要,论证分类器的结果,即每个句子用一个论证类别进行标注,用于修改MeSH分类器的原始排名。
最有效的组合(提高2%,p<0.003)严重加权了“方法”部分,适度加权了“结果”和“结论”部分。
尽管幅度不大,但论证特征对文本分类带来的改进证实,话语分析方法可能有益于科学数字图书馆中的文本挖掘。