Shi Jianlin, Morgan Keaton L, Bradshaw Richard L, Jung Se-Hee, Kohlmann Wendy, Kaphingst Kimberly A, Kawamoto Kensaku, Fiol Guilherme Del
Veterans Affairs Informatics and Computing Infrastructure, Department of Veterans Affairs Salt Lake City Health Care System, Salt Lake City, UT, United States.
Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States.
JMIR Med Inform. 2022 Aug 11;10(8):e37842. doi: 10.2196/37842.
Family health history has been recognized as an essential factor for cancer risk assessment and is an integral part of many cancer screening guidelines, including genetic testing for personalized clinical management strategies. However, manually identifying eligible candidates for genetic testing is labor intensive.
The aim of this study was to develop a natural language processing (NLP) pipeline and assess its contribution to identifying patients who meet genetic testing criteria for hereditary cancers based on family health history data in the electronic health record (EHR). We compared an algorithm that uses structured data alone with structured data augmented using NLP.
Algorithms were developed based on the National Comprehensive Cancer Network (NCCN) guidelines for genetic testing for hereditary breast, ovarian, pancreatic, and colorectal cancers. The NLP-augmented algorithm uses both structured family health history data and the associated unstructured free-text comments. The algorithms were compared with a reference standard of 100 patients with a family health history in the EHR.
Regarding identifying the reference standard patients meeting the NCCN criteria, the NLP-augmented algorithm compared with the structured data algorithm yielded a significantly higher recall of 0.95 (95% CI 0.9-0.99) versus 0.29 (95% CI 0.19-0.40) and a precision of 0.99 (95% CI 0.96-1.00) versus 0.81 (95% CI 0.65-0.95). On the whole data set, the NLP-augmented algorithm extracted 33.6% more entities, resulting in 53.8% more patients meeting the NCCN criteria.
Compared with the structured data algorithm, the NLP-augmented algorithm based on both structured and unstructured family health history data in the EHR increased the number of patients identified as meeting the NCCN criteria for genetic testing for hereditary breast or ovarian and colorectal cancers.
家族健康史已被视为癌症风险评估的重要因素,并且是许多癌症筛查指南的一个组成部分,包括用于个性化临床管理策略的基因检测。然而,手动识别基因检测的合格候选人需要耗费大量人力。
本研究的目的是开发一种自然语言处理(NLP)流程,并评估其对基于电子健康记录(EHR)中的家族健康史数据识别符合遗传性癌症基因检测标准的患者的贡献。我们将仅使用结构化数据的算法与使用NLP增强的结构化数据算法进行了比较。
根据美国国立综合癌症网络(NCCN)关于遗传性乳腺癌、卵巢癌、胰腺癌和结直肠癌基因检测的指南开发算法。NLP增强算法同时使用结构化家族健康史数据和相关的非结构化自由文本注释。将这些算法与EHR中100名有家族健康史患者的参考标准进行比较。
在识别符合NCCN标准的参考标准患者方面,与结构化数据算法相比,NLP增强算法的召回率显著更高,分别为0.95(95%CI 0.9 - 0.99)和0.29(95%CI 0.19 - 0.40),精度分别为0.99(95%CI 0.96 - 1.00)和0.81(95%CI 0.65 - 0.95)。在整个数据集上,NLP增强算法提取的实体多33.6%,导致符合NCCN标准的患者多53.8%。
与结构化数据算法相比,基于EHR中结构化和非结构化家族健康史数据的NLP增强算法增加了被识别为符合遗传性乳腺癌或卵巢癌以及结直肠癌基因检测NCCN标准的患者数量。