Oh Inez Y, Schindler Suzanne E, Ghoshal Nupur, Lai Albert M, Payne Philip R O, Gupta Aditi
Institute for Informatics, Washington University School of Medicine, St. Louis, Missouri, USA.
Department of Neurology, Washington University School of Medicine, St. Louis, Missouri, USA.
JAMIA Open. 2023 Feb 24;6(1):ooad014. doi: 10.1093/jamiaopen/ooad014. eCollection 2023 Apr.
There is much interest in utilizing clinical data for developing prediction models for Alzheimer's disease (AD) risk, progression, and outcomes. Existing studies have mostly utilized curated research registries, image analysis, and structured electronic health record (EHR) data. However, much critical information resides in relatively inaccessible unstructured clinical notes within the EHR.
We developed a natural language processing (NLP)-based pipeline to extract AD-related clinical phenotypes, documenting strategies for success and assessing the utility of mining unstructured clinical notes. We evaluated the pipeline against gold-standard manual annotations performed by 2 clinical dementia experts for AD-related clinical phenotypes including medical comorbidities, biomarkers, neurobehavioral test scores, behavioral indicators of cognitive decline, family history, and neuroimaging findings.
Documentation rates for each phenotype varied in the structured versus unstructured EHR. Interannotator agreement was high (Cohen's kappa = 0.72-1) and positively correlated with the NLP-based phenotype extraction pipeline's performance (average F1-score = 0.65-0.99) for each phenotype.
We developed an automated NLP-based pipeline to extract informative phenotypes that may improve the performance of eventual machine learning predictive models for AD. In the process, we examined documentation practices for each phenotype relevant to the care of AD patients and identified factors for success.
Success of our NLP-based phenotype extraction pipeline depended on domain-specific knowledge and focus on a specific clinical domain instead of maximizing generalizability.
利用临床数据来开发针对阿尔茨海默病(AD)风险、病情进展及预后的预测模型备受关注。现有研究大多利用经过整理的研究登记库、图像分析以及结构化电子健康记录(EHR)数据。然而,许多关键信息存在于电子健康记录中相对难以获取的非结构化临床笔记里。
我们开发了一种基于自然语言处理(NLP)的流程,用于提取与AD相关的临床表型,记录成功策略并评估挖掘非结构化临床笔记的效用。我们对照由2名临床痴呆症专家针对与AD相关的临床表型(包括医学合并症、生物标志物、神经行为测试分数、认知衰退的行为指标、家族史以及神经影像学检查结果)所进行的金标准人工注释,对该流程进行了评估。
每种表型在结构化与非结构化电子健康记录中的记录率有所不同。注释者间一致性较高(科恩kappa系数=0.72 - 1),并且与基于NLP的表型提取流程针对每种表型的性能(平均F1分数=0.65 - 0.99)呈正相关。
我们开发了一种基于NLP的自动化流程,以提取可能改善最终用于AD的机器学习预测模型性能的信息性表型。在此过程中,我们检查了与AD患者护理相关的每种表型的记录实践,并确定了成功因素。
我们基于NLP的表型提取流程的成功取决于特定领域的知识,并专注于特定临床领域,而非最大化通用性。