Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.
School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA.
J Am Med Inform Assoc. 2022 Jun 14;29(7):1208-1216. doi: 10.1093/jamia/ocac040.
Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models.
A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task.
All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively.
The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.
准确提取乳腺癌患者的表型对于临床决策支持和临床研究至关重要。本研究开发并评估了用于从临床文本中提取乳腺癌表型的癌症领域预训练的 CancerBERT 模型,并研究了定制癌症相关词汇对 CancerBERT 模型性能的影响。
从当地医院的电子病历中提取了乳腺癌患者的癌症相关语料库。我们对 200 份病理报告和 50 份临床记录中的 8 种癌症表型的命名实体进行了注释,以进行微调和评估。我们继续使用扩展词汇(基于词频和手动审查的方法)对 BlueBERT 模型进行预训练,以获得 CancerBERT 模型。我们评估了 CancerBERT 模型,并将其与癌症表型提取任务的其他基线模型进行了比较。
所有的 CancerBERT 模型在癌症表型识别的命名实体识别任务上的表现均优于其他所有模型。使用定制词汇的两个 CancerBERT 模型的表现均优于使用原始 BERT 词汇的 CancerBERT 模型。使用手动审查定制词汇的 CancerBERT 模型在精确匹配和宽松匹配的宏 F1 分数分别为 0.876(95%置信区间,0.873-0.879)和 0.904(95%置信区间,0.902-0.906)时表现最佳。
我们开发了 CancerBERT 模型来提取临床记录和病理报告中的癌症表型。结果验证了在临床自然语言处理任务中使用定制词汇可以进一步提高领域特定的 BERT 模型的性能。本研究中开发的 CancerBERT 模型将进一步有助于临床决策支持。