Suppr超能文献

基于深度学习的自然语言处理模型在系统文献综述任务的全文数据元素提取中的应用。

Use of deep learning-based NLP models for full-text data elements extraction for systematic literature review tasks.

作者信息

Du Jingcheng, Wang Dong, Lin Bin, He Long, Huang Liang-Chin, Wang Jingqi, Manion Frank J, Li Yeran, Cossrow Nicole, Yao Lixia

机构信息

Intelligent Medical Objects, Houston, TX, USA.

Merck & Co., Inc., Rahway, NJ, USA.

出版信息

Sci Rep. 2025 Jun 3;15(1):19379. doi: 10.1038/s41598-025-03979-5.

Abstract

Systematic literature review (SLR) is an important tool for Health Economics and Outcomes Research (HEOR) evidence synthesis. SLRs involve the identification and selection of pertinent publications and extraction of relevant data elements from full-text articles, which can be a manually intensive procedure. Previously we developed machine learning models to automatically identify relevant publications based on pre-specified inclusion and exclusion criteria. This study investigates the feasibility of applying Natural Language Processing (NLP) approaches to automatically extract data elements from the relevant scientific literature. First, 239 full-text articles were collected and annotated for 12 important variables including study cohort, lab technique, and disease type, for proper SLR summary of Human papillomavirus (HPV) Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden. The three resulting annotated corpora are shared publicly at [ https://github.com/Merck/NLP-SLR-corpora ], to provide training data and a benchmark baseline for the NLP community to further research this challenging task. We then compared three classic Named Entity Recognition (NER) algorithms, namely Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), and the Bidirectional Encoder Representations from Transformers (BERT) models, to assess performance on the data element extraction task. The annotation corpora contain 4,498, 579, and 252 annotated entity mentions for HPV Prevalence, Pneumococcal Epidemiology, and Pneumococcal Economic Burden tasks respectively. Deep learning algorithms achieved superior performance in recognizing the targeted SLR data elements, compared to conventional machine learning algorithms. LSTM models have achieved 0.890, 0.646 and 0.615 micro-averaged F1 scores for three tasks respectively. CRF models could not provide comparable performance on most of the elements of interest. Although BERT-based models are known to generally achieve superior performance on many NLP tasks, we did not observe improvement in our three tasks. Deep learning algorithms have achieved superior performance compared with machine learning models on multiple SLR data element extraction tasks. LSTM model, in particular, is more preferable for deployment in supporting HEOR SLR data element extraction, due to its better performance, generalizability, and scalability as it's cost-effective in our SLR benchmark datasets.

摘要

系统文献综述(SLR)是卫生经济学与结果研究(HEOR)证据综合的重要工具。SLR涉及相关出版物的识别与筛选,以及从全文文章中提取相关数据元素,这可能是一个人工密集型过程。此前我们开发了机器学习模型,以根据预先指定的纳入和排除标准自动识别相关出版物。本研究探讨应用自然语言处理(NLP)方法从相关科学文献中自动提取数据元素的可行性。首先,收集了239篇全文文章,并针对包括研究队列、实验室技术和疾病类型在内的12个重要变量进行注释,以便对人乳头瘤病毒(HPV)患病率、肺炎球菌流行病学和肺炎球菌经济负担进行恰当的SLR总结。由此产生的三个注释语料库在[https://github.com/Merck/NLP-SLR-corpora]上公开共享,为NLP社区提供训练数据和基准基线,以进一步研究这一具有挑战性的任务。然后,我们比较了三种经典的命名实体识别(NER)算法,即条件随机场(CRF)、长短期记忆(LSTM)和来自变换器的双向编码器表示(BERT)模型,以评估在数据元素提取任务上的性能。注释语料库分别包含针对HPV患病率、肺炎球菌流行病学和肺炎球菌经济负担任务的4498、579和252个注释实体提及。与传统机器学习算法相比,深度学习算法在识别目标SLR数据元素方面表现更优。LSTM模型在三项任务中分别取得了0.890、0.646和0.615的微平均F1分数。CRF模型在大多数感兴趣的元素上无法提供可比的性能。尽管基于BERT的模型通常在许多NLP任务上表现更优,但在我们的三项任务中并未观察到性能提升。在多个SLR数据元素提取任务中,深度学习算法与机器学习模型相比表现更优。特别是LSTM模型,由于其在我们的SLR基准数据集中具有更好的性能、通用性和可扩展性且成本效益高,因此更适合部署用于支持HEOR SLR数据元素提取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/22b3/12134170/091bfdec8bd4/41598_2025_3979_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验