Suppr超能文献

利用自然语言处理从临床记录中识别胰腺癌风险因素。

Identification of pancreatic cancer risk factors from clinical notes using natural language processing.

机构信息

Department of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA.

Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.

出版信息

Pancreatology. 2024 Jun;24(4):572-578. doi: 10.1016/j.pan.2024.03.016. Epub 2024 Mar 26.

Abstract

OBJECTIVES

Screening for pancreatic ductal adenocarcinoma (PDAC) is considered in high-risk individuals (HRIs) with established PDAC risk factors, such as family history and germline mutations in PDAC susceptibility genes. Accurate assessment of risk factor status is provider knowledge-dependent and requires extensive manual chart review by experts. Natural Language Processing (NLP) has shown promise in automated data extraction from the electronic health record (EHR). We aimed to use NLP for automated extraction of PDAC risk factors from unstructured clinical notes in the EHR.

METHODS

We first developed rule-based NLP algorithms to extract PDAC risk factors at the document-level, using an annotated corpus of 2091 clinical notes. Next, we further improved the NLP algorithms using a cohort of 1138 patients through patient-level training, validation, and testing, with comparison against a pre-specified reference standard. To minimize false-negative results we prioritized algorithm recall.

RESULTS

In the test set (n = 807), the NLP algorithms achieved a recall of 0.933, precision of 0.790, and F-score of 0.856 for family history of PDAC. For germline genetic mutations, the algorithm had a high recall of 0.851, while precision and F-score were lower at 0.350 and 0.496 respectively. Most false positives for germline mutations resulted from erroneous recognition of tissue mutations.

CONCLUSIONS

Rule-based NLP algorithms applied to unstructured clinical notes are highly sensitive for automated identification of PDAC risk factors. Further validation in a large primary-care patient population is warranted to assess real-world utility in identifying HRIs for pancreatic cancer screening.

摘要

目的

在具有已确定的胰腺导管腺癌 (PDAC) 风险因素的高危个体 (HRIs) 中,考虑进行 PDAC 筛查,例如家族史和 PDAC 易感性基因的种系突变。风险因素状态的准确评估依赖于提供者的知识,需要专家进行广泛的手动图表审查。自然语言处理 (NLP) 已显示出从电子健康记录 (EHR) 中自动提取数据的潜力。我们旨在使用 NLP 从 EHR 中的非结构化临床记录中自动提取 PDAC 风险因素。

方法

我们首先开发了基于规则的 NLP 算法,以在文档级别提取 PDAC 风险因素,使用 2091 份临床记录的注释语料库。接下来,我们通过 1138 名患者的患者级培训、验证和测试,进一步改进了 NLP 算法,并与预定义的参考标准进行了比较,以最小化假阴性结果。为了最大限度地提高算法的召回率,我们优先考虑了算法的召回率。

结果

在测试集中(n=807),NLP 算法对 PDAC 家族史的召回率为 0.933,精度为 0.790,F1 得分为 0.856。对于种系基因突变,该算法的召回率很高,为 0.851,而精度和 F1 得分分别较低,为 0.350 和 0.496。种系突变的大多数假阳性结果是由于错误识别组织突变所致。

结论

应用于非结构化临床记录的基于规则的 NLP 算法对 PDAC 风险因素的自动识别具有很高的敏感性。需要在大型初级保健患者人群中进一步验证,以评估其在识别胰腺癌筛查高危个体方面的实际应用。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验