Sam Khinda, Senior Project Director, IQVIA Project Leadership, 500 Brook Drive, Green Park, Reading, Berks RG2 6UU, UK. E-mail:
J Prev Alzheimers Dis. 2019;6(3):185-191. doi: 10.14283/jpad.2019.10.
Recruiting patients for clinical trials of potential therapies for Alzheimer's disease (AD) remains a major challenge, with demand for trial participants at an all-time high. The AD treatment R and D pipeline includes around 112 agents. In the United States alone, 150 clinical trials are seeking 70,000 participants. Most people with early cognitive impairment consult primary care providers, who may lack time, diagnostic skills and awareness of local clinical trials. Machine learning and predictive analytics offer promise to boost enrollment by predicting which patients have prodromal AD, and which will go on to develop AD.
The authors set out to develop a machine learning predictive model that identifies prodromal AD patients in the general population, to aid early AD detection by primary care physicians and timely referral to expert sites for biomarker confirmation of diagnosis and clinical trial enrollment.
The authors use a classification machine learning algorithm to extract patterns within healthcare claims and prescription data three years prior to AD diagnosis/AD drug initiation.
The study focused on subjects included within proprietary IQVIA US data assets (claims and prescription databases). Patient information was extracted from January 2010 to July 2018, for cohorts aged between 50 and 85 years.
A total of 88,298,289 subjects aged between 50 and 85 years were identified. For the positive cohort, 667,288 subjects were identified who had 24 months of medical history and at least one record with AD or AD treatment. For the negative cohort, 3,670,254 patients were selected who had a similar length of medical history and who were matched to positive cohort subjects based on the prevalence rate. The scoring cohort was selected based on availability of recent medical data of 2-5 years and included 72,670,283 subjects between the ages of 50 and 85 years. Intervention (if any): None.
A list of clinically-relevant and interpretable predictors was generated and extracted from the data sets for each subject, including pharmacological treatments (NDC/product), office/specialist visits (specialty), tests and procedures (HCPCS and CPT), and diagnosis (ICD). The positive cohort was defined as patients who have AD diagnosis/AD treatment with a 3 years offset as an estimate for prodromal AD diagnosis. Supervised ML techniques were used to develop algorithms to predict the occurrence of prodromal AD cases. The sample dataset was divided randomly into a training dataset and a test dataset. The classification models were trained and executed in the PySpark framework. Training and evaluation of LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier were executed using PySpark's mllib module. The area under the precision-recall curve (AUCPR) was used to compare the results of the various models.
The AUCPRs are 0.426, 0.157, 0.436, and 0.440 for LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier, respectively, meaning that GBTClassifier (Gradient Boosted Tree) outperforms the other three classifiers. The GBT model identified 222,721 subjects in the prodromal AD stage with 80% precision. Some 76% of identified prodromal AD patients were in the primary care setting.
Applying the developed predictive model to 72,670,283 U.S. residents, 222,721 prodromal AD patients were identified, the majority of whom were in the primary care setting. This could drive major advances in AD research by enabling more accurate and earlier prodromal AD diagnosis at the primary care physician level , which would facilitate timely referral to expert sites for in-depth assessment and potential enrolment in clinical trials.
招募潜在治疗阿尔茨海默病(AD)的临床试验患者仍然是一个主要挑战,对试验参与者的需求达到历史最高水平。AD 治疗研发管道包括大约 112 种药物。仅在美国,就有 150 项临床试验正在招募 7 万名参与者。大多数有早期认知障碍的人会咨询初级保健医生,而这些医生可能缺乏时间、诊断技能和对当地临床试验的认识。机器学习和预测分析有望通过预测哪些患者有前驱 AD,以及哪些患者将发展为 AD,从而提高入组率。
作者旨在开发一种机器学习预测模型,以识别普通人群中的前驱 AD 患者,帮助初级保健医生早期发现 AD,并及时转介至专家机构进行生物标志物诊断和临床试验入组。
作者使用分类机器学习算法从 AD 诊断/AD 药物开始前三年的医疗保健索赔和处方数据中提取模式。
该研究重点关注 IQVIA 美国专有数据资产(索赔和处方数据库)中包含的受试者。患者信息从 2010 年 1 月至 2018 年 7 月提取,队列年龄在 50 至 85 岁之间。
共确定了 88298289 名年龄在 50 至 85 岁之间的受试者。对于阳性队列,确定了 667288 名受试者,他们有 24 个月的病史,至少有一次 AD 或 AD 治疗记录。对于阴性队列,选择了 3670254 名患者,他们有类似的病史,并根据患病率与阳性队列受试者相匹配。评分队列是根据最近 2-5 年的医疗数据可用性选择的,包括 50 至 85 岁之间的 72670283 名受试者。干预措施(如果有):无。
从每个受试者的数据集生成并提取了一组具有临床意义且可解释的预测因子,包括药理学治疗(NDC/产品)、办公室/专家就诊(专科)、测试和程序(HCPCS 和 CPT)以及诊断(ICD)。阳性队列定义为有 AD 诊断/AD 治疗的患者,有 3 年的偏移量作为前驱 AD 诊断的估计值。使用监督机器学习技术来开发预测前驱 AD 病例发生的算法。样本数据集随机分为训练数据集和测试数据集。在 PySpark 框架中训练和执行分类模型。使用 PySpark 的 mllib 模块执行 LogisticRegression、DecisionTreeClassifier、RandomForestClassifier 和 GBTClassifier 的训练和评估。使用精度-召回曲线下的面积(AUCPR)来比较各种模型的结果。
LogisticRegression、DecisionTreeClassifier、RandomForestClassifier 和 GBTClassifier 的 AUCPRs 分别为 0.426、0.157、0.436 和 0.440,这意味着 GBTClassifier(梯度提升树)优于其他三个分类器。GBT 模型在 80%的精度下识别出 222721 名前驱 AD 阶段的受试者。识别出的前驱 AD 患者中有 76%在初级保健环境中。
将开发的预测模型应用于 72670283 名美国居民,识别出 222721 名前驱 AD 患者,其中大多数人在初级保健环境中。这可以通过在初级保健医生层面实现更准确和更早的前驱 AD 诊断,从而推动 AD 研究的重大进展,这将有助于及时转介至专家机构进行深入评估,并有可能参加临床试验。