BERTMeSH：基于深度上下文表示学习的大规模高性能 MeSH 索引与全文检索

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text.

机构信息

School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan.

出版信息

Bioinformatics. 2021 May 5;37(5):684-692. doi: 10.1093/bioinformatics/btaa837.

DOI:10.1093/bioinformatics/btaa837

PMID:32976559

Abstract

MOTIVATION

With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database.

RESULTS

We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着生物医学文献的快速增加，大规模的自动医学主题词（MeSH）标引变得越来越重要。FullMeSH 是唯一一种具有全文的大规模 MeSH 标引方法，但存在三个主要缺陷：（i）FullMeSH 使用学习排序，这很耗时；（ii）只能捕获全文中一些预定义的部分；（iii）忽略整个 MEDLINE 数据库。

结果

我们提出了一种计算量更小、基于全文和深度学习的 MeSH 标引方法 BERTMeSH，它灵活适用于全文中的章节组织。BERTMeSH 有两项技术：（i）使用最先进的预训练深度上下文表示，即来自 Transformer 的双向编码器表示（BERT），这使得 BERTMeSH 能够捕获全文的深层语义。（ii）一种在 PubMed Central（PMC）中使用全文和在 MEDLINE 中仅使用标题和摘要（没有全文）的转移学习策略，以充分利用两者的优势。在我们的实验中，BERTMeSH 以 300 万篇 MEDLINE 引文进行预训练，并在 PMC 中的约 150 万篇全文上进行训练。BERTMeSH 优于各种前沿基线。例如，对于 PMC 的 20K 测试文章，BERTMeSH 的微 F 测度为 69.2%，比 FullMeSH 高 6.3%，差异具有统计学意义。此外，BERTMeSH 只需 5 分钟即可预测 20K 测试文章，而 FullMeSH 则需要 10 多个小时，证明了 BERTMeSH 的计算效率。