Li Ruowang, Benz Luke, Duan Rui, Denny Joshua C, Hakonarson Hakon, Mosley Jonathan D, Smoller Jordan W, Wei Wei-Qi, Lumley Thomas, Ritchie Marylyn D, Moore Jason H, Chen Yong
Department of Computational Biomedicine, Cedars-Sinai Medical Center.
Department of Biostatistics, Harvard T.H. Chan School of Public Health.
medRxiv. 2024 Dec 4:2024.01.09.24301073. doi: 10.1101/2024.01.09.24301073.
In cross-cohort studies, integrating diverse datasets, such as electronic health records (EHRs), is both essential and challenging due to cohort-specific variations, distributed data storage, and data privacy concerns. Traditional methods often require data pooling or complex data harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed EHR datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,534 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.
在跨队列研究中,整合多种数据集,如电子健康记录(EHR),由于队列特定的差异、分布式数据存储和数据隐私问题,既至关重要又具有挑战性。传统方法通常需要数据合并或复杂的数据协调,这可能会降低效率并限制跨队列学习的范围。我们引入了mixWAS,这是一种一次性的无损算法,通过汇总统计有效地整合分布式EHR数据集。与现有方法不同,mixWAS保留了队列特定的协变量关联,并支持同时进行混合结果分析。模拟表明,在各种场景下,mixWAS在准确性和效率方面均优于传统方法。将mixWAS应用于美国七个队列的EHR数据,该算法在血脂、BMI和循环系统疾病等性状之间识别出4534个显著的跨队列基因关联。使用独立的英国EHR数据集进行验证,证实了其中97.7%的关联,凸显了该算法的稳健性。通过实现无损跨队列整合,mixWAS提高了多结果分析的精度,并扩大了医疗保健研究中可采取行动的见解的潜力。