Noshin Kazi, Boland Mary Regina, Hou Bojian, Lu Victoria, Manning Carol, Shen Li, Zhang Aidong
Department of Computer Science, University of Virginia, VA 22903, USA.
Pac Symp Biocomput. 2025;30:631-646.
Alzheimer's Disease and Related Dementias (ADRD) afflict almost 7 million people in the USA alone. The majority of research in ADRD is conducted using post-mortem samples of brain tissue or carefully recruited clinical trial patients. While these resources are excellent, they suffer from lack of sex/gender, and racial/ethnic inclusiveness. Electronic Health Records (EHR) data has the potential to bridge this gap by including real-world ADRD patients treated during routine clinical care. In this study, we utilize EHR data from a cohort of 70,420 ADRD patients diagnosed and treated at Penn Medicine. Our goal is to uncover important risk features leading to three types of Neuro-Degenerative Disorders (NDD), including Alzheimer's Disease (AD), Parkinson's Disease (PD) and Other Dementias (OD). We employ a variety of Machine Learning (ML) Methods, including uni-variate and multivariate ML approaches and compare accuracies across the ML methods. We also investigate the types of features identified by each method, the overlapping features and the unique features to highlight important advantages and disadvantages of each approach specific for certain NDD types. Our study is important for those interested in studying ADRD and NDD in EHRs as it highlights the strengths and limitations of popular approaches employed in the ML community. We found that the uni-variate approach was able to uncover features that were important and rare for specific types of NDD (AD, PD, OD), which is important from a clinical perspective. Features that were found across all methods represent features that are the most robust.
仅在美国,阿尔茨海默病及相关痴呆症(ADRD)就折磨着近700万人。ADRD的大多数研究是使用脑组织的尸检样本或精心招募的临床试验患者进行的。虽然这些资源很棒,但它们缺乏性别和种族/民族包容性。电子健康记录(EHR)数据有可能通过纳入在常规临床护理中接受治疗的真实世界ADRD患者来弥补这一差距。在本研究中,我们利用了宾夕法尼亚大学医疗系统诊断和治疗的70420名ADRD患者队列的EHR数据。我们的目标是发现导致三种神经退行性疾病(NDD)的重要风险特征,包括阿尔茨海默病(AD)、帕金森病(PD)和其他痴呆症(OD)。我们采用了多种机器学习(ML)方法,包括单变量和多变量ML方法,并比较了这些ML方法的准确性。我们还研究了每种方法识别的特征类型、重叠特征和独特特征,以突出每种方法针对特定NDD类型的重要优缺点。我们的研究对那些对在EHR中研究ADRD和NDD感兴趣的人很重要,因为它突出了ML社区中常用方法的优势和局限性。我们发现,单变量方法能够发现特定类型NDD(AD、PD、OD)中重要且罕见的特征,从临床角度来看这很重要。在所有方法中都发现的特征代表了最稳健的特征。