Department of Biomedical Informatics, University of Utah, Salt Lake City, UT.
Huntsman Cancer Institute, University of Utah, Salt Lake City, UT.
JCO Clin Cancer Inform. 2023 Jan;7:e2200131. doi: 10.1200/CCI.22.00131.
Histopathologic features are critical for studying risk factors of colorectal polyps, but remain deeply embedded within unstructured pathology reports, requiring costly and time-consuming manual abstraction for research. In this study, we developed and evaluated a natural language processing (NLP) pipeline to automatically extract histopathologic features of colorectal polyps from pathology reports, with an emphasis on individual polyp size. These data were then linked with structured electronic health record (EHR) data, creating an analysis-ready epidemiologic data set.
We obtained 24,584 pathology reports from colonoscopies performed at the University of Utah's Gastroenterology Clinic. Two investigators annotated 350 reports to determine inter-rater agreement, develop an annotation scheme, and create a reference standard for performance evaluation. The pipeline was then developed, and performance was compared against the reference for extracting polyp location, histology, size, shape, dysplasia, and the number of polyps. Finally, the pipeline was applied to 24,225 unseen reports and NLP-extracted data were linked with structured EHR data.
Across all features, our pipeline achieved a precision of 98.9%, a recall of 98.0%, and an F1-score of 98.4%. In patients with polyps, the pipeline correctly extracted 95.6% of sizes, 97.2% of polyp locations, 97.8% of histology, 98.3% of shapes, and 98.3% of dysplasia levels. When applied to unseen data, the pipeline classified 12,889 patients as having polyps, 4,907 patients without polyps, and extracted the features of 28,387 polyps. Tubular adenomas were the most common subtype (55.9%), 8.1% of polyps were advanced adenomas, and the mean polyp size was 0.57 (±0.4) cm.
Our pipeline extracted histopathologic features of colorectal polyps from colonoscopy pathology reports, most notably individual polyp sizes, with considerable accuracy. This study demonstrates the utility of NLP for extracting polyp features and linking these data with EHR data to create an epidemiologic data set to study colorectal polyp risk factors and outcomes.
组织病理学特征对于研究结直肠息肉的危险因素至关重要,但这些特征深埋于非结构化的病理报告中,需要耗费大量的时间和成本进行手动提取,以用于研究。在这项研究中,我们开发并评估了一种自然语言处理(NLP)管道,以便从病理报告中自动提取结直肠息肉的组织病理学特征,重点是单个息肉的大小。然后将这些数据与结构化电子健康记录(EHR)数据相关联,创建一个可用于分析的流行病学数据集。
我们从犹他大学胃肠病学诊所进行的结肠镜检查中获取了 24584 份病理报告。两名研究人员对 350 份报告进行了注释,以确定组内一致性、制定注释方案,并为性能评估创建参考标准。然后开发了该管道,并将其性能与提取息肉位置、组织学、大小、形状、异型增生和息肉数量的参考标准进行了比较。最后,将该管道应用于 24225 份未见报告,并将 NLP 提取的数据与结构化 EHR 数据相关联。
在所有特征中,我们的管道在提取息肉位置、组织学、大小、形状、异型增生和息肉数量方面的精度均达到 98.9%,召回率为 98.0%,F1 得分为 98.4%。在有息肉的患者中,该管道正确提取了 95.6%的息肉大小、97.2%的息肉位置、97.8%的组织学、98.3%的形状和 98.3%的异型增生水平。当应用于未见报告时,该管道将 12889 名患者归类为有息肉,4907 名患者无息肉,并提取了 28387 个息肉的特征。管状腺瘤是最常见的亚型(55.9%),8.1%的息肉为高级别腺瘤,平均息肉大小为 0.57(±0.4)cm。
我们的管道从结肠镜检查病理报告中提取了结直肠息肉的组织病理学特征,尤其是单个息肉的大小,具有相当高的准确性。本研究证明了 NLP 用于提取息肉特征并将这些数据与 EHR 数据相关联以创建用于研究结直肠息肉危险因素和结果的流行病学数据集的实用性。