Department of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA.
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.
Pancreatology. 2024 Jun;24(4):572-578. doi: 10.1016/j.pan.2024.03.016. Epub 2024 Mar 26.
Screening for pancreatic ductal adenocarcinoma (PDAC) is considered in high-risk individuals (HRIs) with established PDAC risk factors, such as family history and germline mutations in PDAC susceptibility genes. Accurate assessment of risk factor status is provider knowledge-dependent and requires extensive manual chart review by experts. Natural Language Processing (NLP) has shown promise in automated data extraction from the electronic health record (EHR). We aimed to use NLP for automated extraction of PDAC risk factors from unstructured clinical notes in the EHR.
We first developed rule-based NLP algorithms to extract PDAC risk factors at the document-level, using an annotated corpus of 2091 clinical notes. Next, we further improved the NLP algorithms using a cohort of 1138 patients through patient-level training, validation, and testing, with comparison against a pre-specified reference standard. To minimize false-negative results we prioritized algorithm recall.
In the test set (n = 807), the NLP algorithms achieved a recall of 0.933, precision of 0.790, and F-score of 0.856 for family history of PDAC. For germline genetic mutations, the algorithm had a high recall of 0.851, while precision and F-score were lower at 0.350 and 0.496 respectively. Most false positives for germline mutations resulted from erroneous recognition of tissue mutations.
Rule-based NLP algorithms applied to unstructured clinical notes are highly sensitive for automated identification of PDAC risk factors. Further validation in a large primary-care patient population is warranted to assess real-world utility in identifying HRIs for pancreatic cancer screening.
在具有已确定的胰腺导管腺癌 (PDAC) 风险因素的高危个体 (HRIs) 中,考虑进行 PDAC 筛查,例如家族史和 PDAC 易感性基因的种系突变。风险因素状态的准确评估依赖于提供者的知识,需要专家进行广泛的手动图表审查。自然语言处理 (NLP) 已显示出从电子健康记录 (EHR) 中自动提取数据的潜力。我们旨在使用 NLP 从 EHR 中的非结构化临床记录中自动提取 PDAC 风险因素。
我们首先开发了基于规则的 NLP 算法,以在文档级别提取 PDAC 风险因素,使用 2091 份临床记录的注释语料库。接下来,我们通过 1138 名患者的患者级培训、验证和测试,进一步改进了 NLP 算法,并与预定义的参考标准进行了比较,以最小化假阴性结果。为了最大限度地提高算法的召回率,我们优先考虑了算法的召回率。
在测试集中(n=807),NLP 算法对 PDAC 家族史的召回率为 0.933,精度为 0.790,F1 得分为 0.856。对于种系基因突变,该算法的召回率很高,为 0.851,而精度和 F1 得分分别较低,为 0.350 和 0.496。种系突变的大多数假阳性结果是由于错误识别组织突变所致。
应用于非结构化临床记录的基于规则的 NLP 算法对 PDAC 风险因素的自动识别具有很高的敏感性。需要在大型初级保健患者人群中进一步验证,以评估其在识别胰腺癌筛查高危个体方面的实际应用。