Institute of Health Informatics, University College London, London, UK.
Health Data Research UK, London, UK.
BMC Med Inform Decis Mak. 2021 Dec 8;21(1):343. doi: 10.1186/s12911-021-01693-6.
Alzheimer's disease (AD) is a highly heterogeneous disease with diverse trajectories and outcomes observed in clinical populations. Understanding this heterogeneity can enable better treatment, prognosis and disease management. Studies to date have mainly used imaging or cognition data and have been limited in terms of data breadth and sample size. Here we examine the clinical heterogeneity of Alzheimer's disease patients using electronic health records (EHR) to identify and characterise disease subgroups using multiple clustering methods, identifying clusters which are clinically actionable.
We identified AD patients in primary care EHR from the Clinical Practice Research Datalink (CPRD) using a previously validated rule-based phenotyping algorithm. We extracted and included a range of comorbidities, symptoms and demographic features as patient features. We evaluated four different clustering methods (k-means, kernel k-means, affinity propagation and latent class analysis) to cluster Alzheimer's disease patients. We compared clusters on clinically relevant outcomes and evaluated each method using measures of cluster structure, stability, efficiency of outcome prediction and replicability in external data sets.
We identified 7,913 AD patients, with a mean age of 82 and 66.2% female. We included 21 features in our analysis. We observed 5, 2, 5 and 6 clusters in k-means, kernel k-means, affinity propagation and latent class analysis respectively. K-means was found to produce the most consistent results based on four evaluative measures. We discovered a consistent cluster found in three of the four methods composed of predominantly female, younger disease onset (43% between ages 42-73) diagnosed with depression and anxiety, with a quicker rate of progression compared to the average across other clusters.
Each clustering approach produced substantially different clusters and K-Means performed the best out of the four methods based on the four evaluative criteria. However, the consistent appearance of one particular cluster across three of the four methods potentially suggests the presence of a distinct disease subtype that merits further exploration. Our study underlines the variability of the results obtained from different clustering approaches and the importance of systematically evaluating different approaches for identifying disease subtypes in complex EHR.
阿尔茨海默病(AD)是一种高度异质性疾病,在临床人群中观察到不同的轨迹和结局。了解这种异质性可以实现更好的治疗、预后和疾病管理。迄今为止的研究主要使用影像学或认知数据,并且在数据广度和样本量方面受到限制。在这里,我们使用电子健康记录(EHR)检查阿尔茨海默病患者的临床异质性,使用多种聚类方法识别和描述疾病亚组,确定具有临床可操作性的亚组。
我们使用先前验证的基于规则的表型分析算法从临床实践研究数据链(CPRD)中识别和确定初级保健 EHR 中的 AD 患者。我们提取并包括了一系列合并症、症状和人口统计学特征作为患者特征。我们评估了四种不同的聚类方法(k-means、核 k-means、亲和传播和潜在类别分析)来聚类 AD 患者。我们比较了亚组之间的临床相关结局,并使用聚类结构、稳定性、结局预测效率和在外部数据集的可重复性等措施来评估每种方法。
我们确定了 7913 名 AD 患者,平均年龄为 82 岁,女性占 66.2%。我们在分析中包含了 21 个特征。我们在 k-means、核 k-means、亲和传播和潜在类别分析中分别观察到 5、2、5 和 6 个聚类。基于四项评估指标,k-means 被发现产生了最一致的结果。我们发现,在四种方法中的三种方法中,存在一种一致的聚类,主要由女性组成,发病年龄较轻(42-73 岁之间的占 43%),被诊断为抑郁症和焦虑症,与其他聚类相比,进展速度更快。
每种聚类方法产生的聚类都有很大的不同,而 k-means 在基于四项评估标准的四种方法中表现最好。然而,在四种方法中的三种方法中,出现了一种特定的聚类,这可能表明存在一种独特的疾病亚型,值得进一步探索。我们的研究强调了不同聚类方法获得的结果的可变性,以及系统地评估不同方法对于在复杂 EHR 中识别疾病亚型的重要性。