Kavuluru Ramakanth, Rios Anthony
Division of Biomedical Informatics, Department of Biostatistics, University of Kentucky; Department of Computer Science, University of Kentucky.
Department of Computer Science, University of Kentucky.
AMIA Annu Symp Proc. 2015 Nov 5;2015:697-706. eCollection 2015.
Assigning labels from a hierarchical vocabulary is a well known special case of multi-label classification, often modeled to maximize micro F1-score. However, building accurate binary classifiers for poorly performing labels in the hierarchy can improve both micro and macro F1-scores. In this paper, we propose and evaluate classification strategies involving descendant node instances to build better binary classifiers for non-leaf labels with the use-case of assigning Medical Subject Headings (MeSH) to biomedical articles. Librarians at the National Library of Medicine tag each biomedical article to be indexed by their PubMed information system with terms from the MeSH terminology, a biomedical conceptual hierarchy with over 27,000 terms. Human indexers look at each article's full text to assign a set of most suitable MeSH terms for indexing it. Several recent automated attempts focused on using the article title and abstract text to identify MeSH terms for the corresponding article. Despite these attempts, it is observed that assigning MeSH terms corresponding to certain non-leaf nodes of the MeSH hierarchy is particularly challenging. Non-leaf nodes are very important as they constitute one third of the total number of MeSH terms. Here, we demonstrate the effectiveness of exploiting training examples of descendant terms of non-leaf nodes in improving the performance of conventional classifiers for the corresponding non-leaf MeSH terms. Specifically, we focus on reducing the false positives (FPs) caused due to descendant instances in traditional classifiers. Our methods are able to achieve a relative improvement of 7.5% in macro-F1 score while also increasing the micro-F1 score by 1.6% for a set of 500 non-leaf terms in the MeSH hierarchy. These results strongly indicate the critical role of incorporating hierarchical information in MeSH term prediction. To our knowledge, our effort is the first to demonstrate the role of hierarchical information in improving binary classifiers for non-leaf MeSH terms.
从分层词汇表中分配标签是多标签分类中一种众所周知的特殊情况,通常通过建模来最大化微观F1分数。然而,为层次结构中表现不佳的标签构建准确的二元分类器可以提高微观和宏观F1分数。在本文中,我们提出并评估了涉及后代节点实例的分类策略,以便在将医学主题词(MeSH)分配给生物医学文章的用例中,为非叶标签构建更好的二元分类器。美国国立医学图书馆的馆员使用MeSH术语(一个拥有超过27000个术语的生物医学概念层次结构)中的术语,为其PubMed信息系统要索引的每篇生物医学文章添加标签。人工索引员会查看每篇文章的全文,以分配一组最合适的MeSH术语来对其进行索引。最近的一些自动化尝试集中在使用文章标题和摘要文本为相应文章识别MeSH术语。尽管有这些尝试,但人们发现,为MeSH层次结构的某些非叶节点分配相应的MeSH术语特别具有挑战性。非叶节点非常重要,因为它们占MeSH术语总数的三分之一。在这里,我们展示了利用非叶节点后代术语的训练示例来提高传统分类器对相应非叶MeSH术语性能的有效性。具体来说,我们专注于减少传统分类器中由后代实例导致的误报(FP)。对于MeSH层次结构中的一组500个非叶术语,我们的方法能够在宏观F1分数上实现7.5%的相对提升,同时微观F1分数也提高了1.6%。这些结果有力地表明了在MeSH术语预测中纳入层次信息的关键作用。据我们所知,我们的工作是首次展示层次信息在改进非叶MeSH术语二元分类器中的作用。