Suppr超能文献

开发和验证一种从病理报告中提取乳腺癌临床和病理特征的自然语言处理算法。

Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports.

机构信息

Division of Medical Senology, European Institute of Oncology IRCCS, Milan, Italy.

Division of Early Drug Development for Innovative Therapies, European Institute of Oncology IRCCS, Milan, Italy.

出版信息

JCO Clin Cancer Inform. 2024 Aug;8:e2400034. doi: 10.1200/CCI.24.00034.

Abstract

PURPOSE

Electronic health records (EHRs) are valuable information repositories that offer insights for enhancing clinical research on breast cancer (BC) using real-world data. The objective of this study was to develop a natural language processing (NLP) model specifically designed to extract structured data from BC pathology reports written in natural language.

METHODS

During the initial phase, the algorithm's development cohort comprised 193 pathology reports from 116 patients with BC from 2012 to 2016. A rule-based NLP algorithm was applied to extract 26 variables for analysis and was compared with the manual extraction of data performed by both a data entry specialist and an oncologist. Following the first approach, the data set was expanded to include 513 reports, and a Named Entity Recognition (NER)-NLP model was trained and evaluated using K-fold cross-validation.

RESULTS

The first approach led to a concordance analysis, which revealed an 82.9% agreement between the algorithm and the oncologist, whereas the concordance between the data entry specialist and the oncologist was 90.8%. The second training approach introduced the definition of an NER-NLP model, in which the accuracy showed remarkable potential (97.8%). Notably, the model demonstrated remarkable performance, especially for parameters such as estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and Ki-67 (F1-score 1.0).

CONCLUSION

The present study aligns with the rapidly evolving field of artificial intelligence (AI) applications in oncology, seeking to expedite the development of complex cancer databases and registries. The results of the model are currently undergoing postprocessing procedures to organize the data into tabular structures, facilitating their utilization in real-world clinical and research endeavors.

摘要

目的

电子健康记录(EHR)是有价值的信息库,可利用真实世界的数据来提高乳腺癌(BC)的临床研究水平。本研究的目的是开发一种专门用于从自然语言书写的 BC 病理报告中提取结构化数据的自然语言处理(NLP)模型。

方法

在初始阶段,算法的开发队列包括 2012 年至 2016 年间 116 名 BC 患者的 193 份病理报告。应用基于规则的 NLP 算法提取 26 个变量进行分析,并与数据录入专家和肿瘤学家手动提取数据进行比较。采用第一种方法后,数据集扩展到 513 份报告,并使用 K 折交叉验证训练和评估命名实体识别(NER)-NLP 模型。

结果

第一种方法进行了一致性分析,结果显示算法与肿瘤学家之间的一致性为 82.9%,而数据录入专家与肿瘤学家之间的一致性为 90.8%。第二种训练方法引入了 NER-NLP 模型的定义,其中准确率显示出显著的潜力(97.8%)。值得注意的是,该模型的表现非常出色,尤其是在雌激素受体、孕激素受体、人表皮生长因子受体 2 和 Ki-67 等参数方面(F1 得分为 1.0)。

结论

本研究与人工智能(AI)在肿瘤学中的应用快速发展的领域相吻合,旨在加速复杂癌症数据库和注册库的开发。目前正在对模型的结果进行后处理程序,将数据组织成表格结构,以便在现实临床和研究工作中使用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验