Park Briton, Altieri Nicholas, DeNero John, Odisho Anobel Y, Yu Bin
Department of Statistics, University of California, Berkeley, California, USA.
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
JAMIA Open. 2021 Sep 30;4(3):ooab085. doi: 10.1093/jamiaopen/ooab085. eCollection 2021 Jul.
We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.
Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.
For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.
Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.
我们开发了自然语言处理(NLP)方法,能够在给定最少标注示例的情况下,从病理报告中准确分类肿瘤属性。我们的分层癌症到癌症转移(HCTC)和零样本字符串相似度(ZSS)方法分别旨在利用不同癌症之间的共享信息和辅助类别特征,以使用丰富注释来提高性能,这些注释为每份病理报告提供了基于位置的信息和文档级标签。
我们的数据包括来自单一机构(加州大学旧金山分校)2002年至2019年的250份肾脏、结肠和肺癌的病理报告。对于每份报告,我们对5个属性进行分类:手术、肿瘤位置、组织学、分级和淋巴管侵犯情况。我们开发了涉及迁移学习和基于丰富注释训练的字符串相似度的新型NLP技术。我们将HCTC和ZSS方法与包括传统机器学习方法以及深度学习方法在内的现有技术进行比较。
对于我们的HCTC方法,在所有癌症和适用属性上平均,我们看到微观F1分数提高了0.1,宏观F1分数提高了0.04。对于我们的ZSS方法,在所有癌症和适用属性上平均,我们看到微观F1分数提高了0.26,宏观F1分数提高了0.23。这些比较是在调整训练数据大小之后进行的,以校正与普通注释相比丰富注释的注释时间增加了20%的情况。
基于跨癌症迁移学习和用字符串相似度先验增强信息的方法,可以显著减少从病理报告中准确提取信息所需的标注数据量。