Department of Computer Science, University of Virginia, Charlottesville, VA, USA.
Department of Information and Communication Engineering, Beijing University of Technology, Beijing, China.
Bioinformatics. 2019 Oct 1;35(19):3794-3802. doi: 10.1093/bioinformatics/btz142.
MEDLINE is the primary bibliographic database maintained by National Library of Medicine (NLM). MEDLINE citations are indexed with Medical Subject Headings (MeSH), which is a controlled vocabulary curated by the NLM experts. This greatly facilitates the applications of biomedical research and knowledge discovery. Currently, MeSH indexing is manually performed by human experts. To reduce the time and monetary cost associated with manual annotation, many automatic MeSH indexing systems have been proposed to assist manual annotation, including DeepMeSH and NLM's official model Medical Text Indexer (MTI). However, the existing models usually rely on the intermediate results of other models and suffer from efficiency issues. We propose an end-to-end framework, MeSHProbeNet (formerly named as xgx), which utilizes deep learning and self-attentive MeSH probes to index MeSH terms. Each MeSH probe enables the model to extract one specific aspect of biomedical knowledge from an input article, thus comprehensive biomedical information can be extracted with different MeSH probes and interpretability can be achieved at word level. MeSH terms are finally recommended with a unified classifier, making MeSHProbeNet both time efficient and space efficient.
MeSHProbeNet won the first place in the latest batch of Task A in the 2018 BioASQ challenge. The result on the last test set of the challenge is reported in this paper. Compared with other state-of-the-art models, such as MTI and DeepMeSH, MeSHProbeNet achieves the highest scores in all the F-measures, including Example Based F-Measure, Macro F-Measure, Micro F-Measure, Hierarchical F-Measure and Lowest Common Ancestor F-measure. We also intuitively show how MeSHProbeNet is able to extract comprehensive biomedical knowledge from an input article.
MEDLINE 是由美国国立医学图书馆(NLM)维护的主要书目数据库。MEDLINE 引文使用医学主题词(MeSH)进行索引,MeSH 是由 NLM 专家策划的受控词汇。这极大地方便了生物医学研究和知识发现的应用。目前,MeSH 索引是由人类专家手动完成的。为了降低与手动注释相关的时间和金钱成本,已经提出了许多自动 MeSH 索引系统来辅助手动注释,包括 DeepMeSH 和 NLM 的官方模型 Medical Text Indexer(MTI)。然而,现有的模型通常依赖于其他模型的中间结果,并存在效率问题。我们提出了一个端到端的框架,MeSHProbeNet(以前称为 xgx),它利用深度学习和自注意 MeSH 探针来索引 MeSH 术语。每个 MeSH 探针使模型能够从输入文章中提取一个特定的生物医学知识方面,因此可以使用不同的 MeSH 探针提取全面的生物医学信息,并在单词级别实现可解释性。MeSH 术语最终通过统一的分类器进行推荐,使 MeSHProbeNet 既高效又节省空间。
MeSHProbeNet 在 2018 年 BioASQ 挑战赛的最新一轮任务 A 中获得第一名。本文报告了该挑战赛最后一个测试集的结果。与其他最先进的模型(如 MTI 和 DeepMeSH)相比,MeSHProbeNet 在所有 F 度量中(包括基于示例的 F 度量、宏 F 度量、微 F 度量、层次 F 度量和最低公共祖先 F 度量)均取得了最高分数。我们还直观地展示了 MeSHProbeNet 如何从输入文章中提取全面的生物医学知识。