School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.
National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
Bioinformatics. 2020 Mar 1;36(5):1533-1541. doi: 10.1093/bioinformatics/btz756.
With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer, MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles.
We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: (i) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. (ii) FullMeSH integrates the evidence from different sections in a 'learning to rank' framework by combining the sparse and deep semantic representations. (iii) FullMeSH trains an Attention-based Convolutional Neural Network for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10 000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings.
The software is available upon request.
Supplementary data are available at Bioinformatics online.
随着生物医学文献的快速增长,通过医学主题词(MeSH)自动对生物医学文章进行索引,即 MeSH 索引,对于促进假设生成和知识发现变得越来越重要。在过去的几年中,已经提出了许多大规模的 MeSH 索引方法,例如 Medical Text Indexer、MeSHLabeler、DeepMeSH 和 MeSHProbeNet。然而,这些方法的性能受到可用信息的限制,即仅使用生物医学文章的标题和摘要。
我们提出了 FullMeSH,这是一种利用全文文章可用性增加的大规模 MeSH 索引方法。与 DeepMeSH 和其他最先进的方法相比,FullMeSH 有三个新颖之处:(i)它不是使用整篇文章,而是将其分割成几个部分,并对其进行规范化标题,以区分它们对整体性能的贡献。(ii)FullMeSH 通过结合稀疏和深度语义表示,在“学习排序”框架中整合来自不同部分的证据。(iii)FullMeSH 为每个部分训练基于注意力的卷积神经网络,这在不常见的 MeSH 标题上实现了更好的性能。FullMeSH 已在 PubMed Central Open Access 子集的 140 万篇全文文章的整个集合上进行开发和实证训练。它在 10000 篇文章的测试集上实现了 66.76%的微 F-measure,分别比 DeepMeSH 和 MeSHLabeler 高 3.3%和 6.4%。此外,对于索引 Check Tags(一组最常索引的 MeSH 标题),FullMeSH 相对于 DeepMeSH 平均提高了 4.7%。
软件可根据要求提供。
补充数据可在 Bioinformatics 在线获得。