State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, PR China.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):660-7. doi: 10.1136/amiajnl-2010-000055. Epub 2011 May 25.
Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval.
Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents.
Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set.
Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.
由于从科学文献中手动整理关键方面的成本很高,因此非常需要自动化方法来辅助这一过程。在这里,我们报告了一种新方法,以促进 MeSH 索引,这是一项为了对 MEDLINE 引文进行存档和检索而分配 MeSH 术语的挑战性任务。
与以前用于自动 MeSH 术语分配的方法不同,我们将索引任务重新表述为一个排序问题,以便将相关的 MeSH 标题排在不相关的标题之前。具体来说,对于每个文档,我们检索 20 个邻居文档,从邻居文档中获取 MeSH 主要标题列表,并使用 ListNet(一种学习排序算法)对 MeSH 主要标题进行排序。我们在 200 篇文档上训练我们的算法,并在之前使用的 200 篇文档基准集和 1000 篇文档的更大数据集上进行测试。
在基准数据集上进行测试,我们的方法的精度为 0.390,召回率为 0.712,平均准确率(MAP)为 0.626。与现有技术相比,我们观察到 MAP 高达 39%的统计学显著改进(p 值 <0.001)。在更大的文档集上也获得了类似的显著改进。
实验结果表明,我们的方法做出了迄今为止最准确的 MeSH 预测,这表明它在对 MeSH 索引产生实际影响方面具有很大的潜力。此外,正如讨论的那样,所提出的学习框架是稳健的,可以适应生物医学领域中除了 MeSH 索引之外的许多其他类似任务。所有数据集均可在以下网址获得:http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing。