Suppr超能文献

自然语言处理在癌症电子健康记录信息提取中的性能:系统评价

Performance of Natural Language Processing for Information Extraction From Electronic Health Records Within Cancer: Systematic Review.

作者信息

Dahl Simon, Bøgsted Martin, Sagi Tomer, Vesteghem Charles

机构信息

Center for Clinical Data Science, Department of Clinical Medicine, Aalborg University, Selma Lagerløfs Vej 249, Gistrup, 9260, Denmark, +45 99407244.

Center for Clinical Data Science, Research, Education and Innovation, Aalborg University Hospital, Aalborg, Denmark.

出版信息

JMIR Med Inform. 2025 Sep 12;13:e68707. doi: 10.2196/68707.

Abstract

BACKGROUND

Over the last decade, natural language processing (NLP) has provided various solutions for information extraction (IE) from textual clinical data. In recent years, the use of NLP in cancer research has gained considerable attention, with numerous studies exploring the effectiveness of various NLP techniques for identifying and extracting cancer-related entities from clinical text data.

OBJECTIVE

We aimed to summarize the performance differences between various NLP models for IE within the context of cancer to provide an overview of the relative performance of existing models.

METHODS

This systematic literature review was conducted using 3 databases (PubMed, Scopus, and Web of Science) to search for articles extracting cancer-related entities from clinical texts. In total, 33 articles were eligible for inclusion. We extracted NLP models and their performance by F1-scores. Each model was categorized into the following categories: rule-based, traditional machine learning, conditional random field-based, neural network, and bidirectional transformer (BT). The average of the performance difference for each combination of categorizations was calculated across all articles.

RESULTS

The articles covered various scenarios, with the best performance for each article ranging from 0.355 to 0.985 in F1-score. Examining the overall relative performances, the BT category outperformed every other category (average F1-score between 0.2335 and 0.0439). The percentage of articles on implementing BTs has increased over the years.

CONCLUSIONS

NLP has demonstrated the ability to identify and extract cancer-related entities from unstructured textual data. Generally, more advanced models outperform less advanced ones. The BT category performed the best.

摘要

背景

在过去十年中,自然语言处理(NLP)为从文本临床数据中提取信息(IE)提供了各种解决方案。近年来,NLP在癌症研究中的应用受到了广泛关注,众多研究探索了各种NLP技术从临床文本数据中识别和提取癌症相关实体的有效性。

目的

我们旨在总结癌症背景下各种NLP模型在信息提取方面的性能差异,以概述现有模型的相对性能。

方法

本系统文献综述使用3个数据库(PubMed、Scopus和Web of Science)搜索从临床文本中提取癌症相关实体的文章。总共有33篇文章符合纳入标准。我们通过F1分数提取NLP模型及其性能。每个模型分为以下几类:基于规则的、传统机器学习、基于条件随机场的、神经网络和双向变压器(BT)。计算所有文章中每个分类组合的性能差异平均值。

结果

这些文章涵盖了各种场景,每篇文章的最佳性能F1分数在0.355至0.985之间。从整体相对性能来看,BT类别优于其他所有类别(平均F1分数在0.2335至0.0439之间)。多年来,关于实施BT的文章百分比有所增加。

结论

NLP已证明有能力从未结构化文本数据中识别和提取癌症相关实体。一般来说,更先进的模型优于不太先进的模型。BT类别表现最佳。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a59/12431712/be6f81fd9a92/medinform-v13-e68707-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验