使用迁移学习和零样本字符串相似度改进从癌症病理报告中提取自然语言信息。

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity.

作者信息

Park Briton, Altieri Nicholas, DeNero John, Odisho Anobel Y, Yu Bin

机构信息

Department of Statistics, University of California, Berkeley, California, USA.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.

出版信息

JAMIA Open. 2021 Sep 30;4(3):ooab085. doi: 10.1093/jamiaopen/ooab085. eCollection 2021 Jul.

DOI:10.1093/jamiaopen/ooab085

PMID:34604711

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8484934/

Abstract

OBJECTIVE

We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.

MATERIALS AND METHODS

Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.

RESULTS

For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.

CONCLUSIONS

Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

摘要

目的

我们开发了自然语言处理（NLP）方法，能够在给定最少标注示例的情况下，从病理报告中准确分类肿瘤属性。我们的分层癌症到癌症转移（HCTC）和零样本字符串相似度（ZSS）方法分别旨在利用不同癌症之间的共享信息和辅助类别特征，以使用丰富注释来提高性能，这些注释为每份病理报告提供了基于位置的信息和文档级标签。

材料与方法

我们的数据包括来自单一机构（加州大学旧金山分校）2002年至2019年的250份肾脏、结肠和肺癌的病理报告。对于每份报告，我们对5个属性进行分类：手术、肿瘤位置、组织学、分级和淋巴管侵犯情况。我们开发了涉及迁移学习和基于丰富注释训练的字符串相似度的新型NLP技术。我们将HCTC和ZSS方法与包括传统机器学习方法以及深度学习方法在内的现有技术进行比较。

结果

对于我们的HCTC方法，在所有癌症和适用属性上平均，我们看到微观F1分数提高了0.1，宏观F1分数提高了0.04。对于我们的ZSS方法，在所有癌症和适用属性上平均，我们看到微观F1分数提高了0.26，宏观F1分数提高了0.23。这些比较是在调整训练数据大小之后进行的，以校正与普通注释相比丰富注释的注释时间增加了20%的情况。

结论

基于跨癌症迁移学习和用字符串相似度先验增强信息的方法，可以显著减少从病理报告中准确提取信息所需的标注数据量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/683d/8484934/764ed1165941/ooab085f1.jpg

相似文献

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity.

JAMIA Open. 2021 Sep 30;4(3):ooab085. doi: 10.1093/jamiaopen/ooab085. eCollection 2021 Jul.

Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data.

J Biomed Inform. 2021 Oct;122:103872. doi: 10.1016/j.jbi.2021.103872. Epub 2021 Aug 16.

Privacy-Preserving Deep Learning NLP Models for Cancer Registries.

IEEE Trans Emerg Top Comput. 2021 Jul-Sep;9(3):1219-1230. doi: 10.1109/tetc.2020.2983404. Epub 2020 Apr 16.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

J Med Internet Res. 2022 Mar 23;24(3):e27210. doi: 10.2196/27210.

A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification.

Res Sq. 2024 Feb 6:rs.3.rs-3914899. doi: 10.21203/rs.3.rs-3914899/v1.

Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.

J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.

Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach.

J Biomed Inform. 2021 Apr;116:103712. doi: 10.1016/j.jbi.2021.103712. Epub 2021 Feb 18.

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.

JAMIA Open. 2020 Oct 14;3(3):431-438. doi: 10.1093/jamiaopen/ooaa029. eCollection 2020 Oct.

Classifying cancer pathology reports with hierarchical self-attention networks.

Artif Intell Med. 2019 Nov;101:101726. doi: 10.1016/j.artmed.2019.101726. Epub 2019 Oct 15.

引用本文的文献

Performance of Natural Language Processing for Information Extraction From Electronic Health Records Within Cancer: Systematic Review.

JMIR Med Inform. 2025 Sep 12;13:e68707. doi: 10.2196/68707.

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.

Patterns (N Y). 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933. eCollection 2024 Mar 8.

本文引用的文献

Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports.

IEEE EMBS Int Conf Biomed Health Inform. 2019 May;2019. doi: 10.1109/bhi.2019.8834586. Epub 2019 Sep 12.

Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data.

J Biomed Inform. 2021 Oct;122:103872. doi: 10.1016/j.jbi.2021.103872. Epub 2021 Aug 16.

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.

JAMIA Open. 2020 Oct 14;3(3):431-438. doi: 10.1093/jamiaopen/ooaa029. eCollection 2020 Oct.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces.

Proc Conf Empir Methods Nat Lang Process. 2018 Oct-Nov;2018:3132-3142.

Zero-Shot Learning-A Comprehensive Evaluation of the Good, the Bad and the Ugly.

IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2251-2265. doi: 10.1109/TPAMI.2018.2857768. Epub 2018 Jul 19.

Clinical information extraction applications: A literature review.

J Biomed Inform. 2018 Jan;77:34-49. doi: 10.1016/j.jbi.2017.11.011. Epub 2017 Nov 21.

Hierarchical attention networks for information extraction from cancer pathology reports.

J Am Med Inform Assoc. 2018 Mar 1;25(3):321-330. doi: 10.1093/jamia/ocx131.

Development of a Natural Language Processing Engine to Generate Bladder Cancer Pathology Data for Health Services Research.

Urology. 2017 Dec;110:84-91. doi: 10.1016/j.urology.2017.07.056. Epub 2017 Sep 12.

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports.

IEEE J Biomed Health Inform. 2018 Jan;22(1):244-251. doi: 10.1109/JBHI.2017.2700722. Epub 2017 May 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用迁移学习和零样本字符串相似度改进从癌症病理报告中提取自然语言信息。

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity.

作者信息

Park Briton, Altieri Nicholas, DeNero John, Odisho Anobel Y, Yu Bin

机构信息

Department of Statistics, University of California, Berkeley, California, USA.

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.