Suppr
超能文献

从自由文本肿瘤病理学报告（CancerBERT 网络）中提取数据的问答系统：开发研究。

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

机构信息

Department of Machine Learning, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.

Department of Medicine, Faculty of Medicine & Dentistry, and the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada.

出版信息

J Med Internet Res. 2022 Mar 23;24(3):e27210. doi: 10.2196/27210.

DOI:10.2196/27210

PMID:35319481

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8987958/

Abstract

BACKGROUND

Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more.

OBJECTIVE

The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports.

METHODS

We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars.

RESULTS

caBERTnet's accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively.

CONCLUSIONS

We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

摘要

背景

病理学报告中的信息对癌症护理至关重要。用于从病理学报告中提取信息的自然语言处理（NLP）系统通常范围狭窄或需要广泛调整。因此，人们对自动化深度学习方法越来越感兴趣。一种强大的新 NLP 算法，即来自转换器的双向编码器表示（BERT），于 2018 年底发布。BERT 在从问答、命名实体识别、语音识别等各种任务中设定了新的性能标准。

目的

本研究旨在开发一种基于 BERT 的系统，从肿瘤病理学报告的自由文本中自动提取详细的肿瘤部位和组织学信息。

方法

我们追求三个具体目标：从病理学报告的自由文本中提取准确的肿瘤部位和组织学描述，适应用于指示相同病理学的不同术语，并为下游应用提供准确的标准化肿瘤部位和组织学代码。我们首先训练了一个基础语言模型来理解病理学报告中的技术语言。这涉及在一个由 275,605 份来自 164,531 个独特患者的电子病理学报告组成的训练语料库上进行无监督学习，其中包含 12100 万个单词。接下来，我们训练了一个问答（Q&A）模型，该模型将问答层连接到基础病理学语言模型，以回答病理学问题。我们的问答系统旨在在每个病理学报告中搜索两个预定义问题的答案：肿瘤位于哪个器官？肿瘤的类型或癌是什么？这涉及在 8197 份病理学报告上进行有监督训练，每份报告均由经过认证的肿瘤登记员确定这两个问题的真实答案。该数据集包括 214 个肿瘤部位和 193 种组织学。问答模型提取的肿瘤部位和组织学短语用于预测国际肿瘤学疾病分类，第三版（ICD-O-3），部位和组织学代码。这涉及对另外两个基于 BERT 的模型进行微调：一个用于预测部位代码，另一个用于预测组织学代码。我们的最终系统包括三个基于 BERT 的模型网络。我们称这个为癌症 BERT 网络（caBERTnet）。我们使用经过认证的肿瘤登记员确定的真实答案的 2050 份病理学报告的隔离测试数据集来评估 caBERTnet。

结果

caBERTnet 预测组级部位和组织学代码的准确率分别为 93.53%（1895/2026）和 97.6%（1993/2042）。在训练数据集中每个具有 5 个或更多样本的精细 ICD-O-3 部位和组织学代码的预测中，排名前 5 的准确率分别为 92.95%（1794/1930）和 96.01%（1853/1930）。

结论

我们开发了一种 NLP 系统，该系统在预测广泛的肿瘤部位和组织学的 ICD-O-3 代码方面优于现有算法。我们的新系统可以帮助减少治疗延误，增加新疗法临床试验的参与，并改善患者的治疗效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ced6/8987958/5bcdb892d0e8/jmir_v24i3e27210_fig1.jpg

相似文献

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

J Med Internet Res. 2022 Mar 23;24(3):e27210. doi: 10.2196/27210.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.

PeerJ Comput Sci. 2024 Feb 28;10:e1888. doi: 10.7717/peerj-cs.1888. eCollection 2024.

Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT).

BMC Med Inform Decis Mak. 2022 Jul 30;22(1):200. doi: 10.1186/s12911-022-01946-y.

Use of BERT (Bidirectional Encoder Representations from Transformers)-Based Deep Learning Method for Extracting Evidences in Chinese Radiology Reports: Development of a Computer-Aided Liver Cancer Diagnosis Framework.

J Med Internet Res. 2021 Jan 12;23(1):e19689. doi: 10.2196/19689.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Comparison of Machine-Learning Algorithms for the Prediction of Current Procedural Terminology (CPT) Codes from Pathology Reports.

J Pathol Inform. 2022 Jan 5;13:3. doi: 10.4103/jpi.jpi_52_21. eCollection 2022.

Information extraction from weakly structured radiological reports with natural language queries.

Eur Radiol. 2024 Jan;34(1):330-337. doi: 10.1007/s00330-023-09977-3. Epub 2023 Jul 28.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

引用本文的文献

Leveraging LLM to identify missed information in patient-physician communication: improving healthcare service quality.

Front Med (Lausanne). 2025 Aug 1;12:1631565. doi: 10.3389/fmed.2025.1631565. eCollection 2025.

Development and evaluation of large-language models (LLMs) for oncology: A scoping review.

PLOS Digit Health. 2025 Aug 7;4(8):e0000980. doi: 10.1371/journal.pdig.0000980. eCollection 2025 Aug.

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.

J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.

Data, dialogue, and design: patient and public involvement and engagement for natural language processing with real-world cancer data.

Front Digit Health. 2025 May 15;7:1560757. doi: 10.3389/fdgth.2025.1560757. eCollection 2025.

Comparative analysis of natural language processing methodologies for classifying computed tomography enterography reports in Crohn's disease patients.

NPJ Digit Med. 2025 May 30;8(1):324. doi: 10.1038/s41746-025-01729-5.

A Narrative Review on the Application of Large Language Models to Support Cancer Care and Research.

Yearb Med Inform. 2024 Aug;33(1):90-98. doi: 10.1055/s-0044-1800726. Epub 2025 Apr 8.

Using Large Language Models to Automate Data Extraction From Surgical Pathology Reports: Retrospective Cohort Study.

JMIR Form Res. 2025 Apr 7;9:e64544. doi: 10.2196/64544.

Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review.

JMIR Cancer. 2025 Mar 28;11:e65984. doi: 10.2196/65984.

Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.

PLoS One. 2025 Feb 18;20(2):e0318726. doi: 10.1371/journal.pone.0318726. eCollection 2025.

Large language models in cancer: potentials, risks, and safeguards.

BJR Artif Intell. 2024 Dec 20;2(1):ubae019. doi: 10.1093/bjrai/ubae019. eCollection 2025 Jan.

本文引用的文献

Federated Learning for Healthcare Informatics.

J Healthc Inform Res. 2021;5(1):1-19. doi: 10.1007/s41666-020-00082-4. Epub 2020 Nov 12.

Advanced natural language processing technique to predict patient disposition based on emergency triage notes.

Emerg Med Australas. 2021 Jun;33(3):480-484. doi: 10.1111/1742-6723.13656. Epub 2020 Oct 11.

Sentiment Analysis Methods for HPV VaccinesRelated Tweets Based on Transfer Learning.

Healthcare (Basel). 2020 Aug 28;8(3):307. doi: 10.3390/healthcare8030307.

Quality Assurance and Continuing Education: A Cyclic Approach for Maintaining High Quality Data in a High Volume Cancer Registry.

Cancer Control. 2020 Jul-Aug;27(3):1073274820946794. doi: 10.1177/1073274820946794.

Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)-based ranking for concept normalization.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1510-1519. doi: 10.1093/jamia/ocaa080.

Deep learning in clinical natural language processing: a methodical review.

J Am Med Inform Assoc. 2020 Mar 1;27(3):457-470. doi: 10.1093/jamia/ocz200.

A frame semantic overview of NLP-based information extraction for cancer-related EHR notes.

J Biomed Inform. 2019 Dec;100:103301. doi: 10.1016/j.jbi.2019.103301. Epub 2019 Oct 4.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook.

Yearb Med Inform. 2018 Aug;27(1):193-198. doi: 10.1055/s-0038-1667080. Epub 2018 Aug 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

从自由文本肿瘤病理学报告（CancerBERT 网络）中提取数据的问答系统：开发研究。

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译