Tsatsaronis George, Macari Natalia, Torge Sunna, Dietze Heiko, Schroeder Michael
Biotechnology Center (BIOTEC), Technische Universität Dresden, 01307 Dresden, Germany.
J Biomed Semantics. 2012 Apr 24;3 Suppl 1(Suppl 1):S2. doi: 10.1186/2041-1480-3-S1-S2.
The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm's performance is resilient to terms' ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.
网络上科学文献数量的不断增加以及缺乏用于对文档进行分类和搜索的有效工具,是影响搜索速度和结果质量的两个最重要因素。先前的研究表明,本体的使用使得在语义层面处理文档和查询信息成为可能,这极大地改善了相关信息的搜索,并朝着语义网迈进了一步。这些方法中的一个基本步骤是用本体概念对文档进行标注,这也可以看作是一个分类任务。在本文中,我们针对生物医学领域解决这个问题,并提出一种基于最大熵方法的新的自动化且稳健的方法,用于用医学主题词表(MeSH)中的术语对生物医学文献文档进行标注。实验评估表明,所建议的用MeSH术语标注生物医学文档的最大熵方法具有很高的准确性,对术语的模糊性具有鲁棒性,并且即使使用非常少量的训练文档也能提供非常好的性能。更确切地说,我们表明所提出的算法在所有探索的术语(4078个MeSH术语)范围内获得了92.4%的平均F值(精确率99.41%,召回率86.77%),并且该算法的性能对术语的模糊性具有弹性,在根据统一医学语言系统(UMLS)词库被发现为模糊的探索的MeSH术语中,平均F值为92.42%(精确率99.32%,召回率86.87%)。最后,我们将所建议方法的结果与朴素贝叶斯和决策树分类方法进行了比较,并且我们表明基于最大熵的方法在模糊和单义的MeSH术语中都以更高的F值表现。