Smith Johanna L, Wong Quenna, Hornsby Whitney, Conomos Matthew P, Heavner Benjamin D, Kullo Iftikhar J, Psaty Bruce M, Rich Stephen S, Stilp Adrienne M, Tayo Bamidele, Zhang Yuji, Natarajan Pradeep, Nelson Sarah C
Cardiovascular Medicine, Mayo Clinic, Rochester, MN 55902, USA.
Biostatistics, University of Washington, Seattle, WA 98195, USA.
Am J Hum Genet. 2025 Jul 3. doi: 10.1016/j.ajhg.2025.06.004.
Sharing diverse genomic and other biomedical datasets is critical to advancing scientific discoveries and their equitable translation to improve human health. However, data sharing remains challenging in the context of legacy datasets, evolving policies, multi-institutional consortium science, and international stakeholders. The NIH-funded Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium was established to improve the performance of polygenic risk estimates for a broad range of health and disease outcomes with global impacts. Improving polygenic risk score performance across genetically diverse populations requires access to large, diverse cohorts. We report on the design and implementation of data-sharing policies and procedures developed in PRIMED to aggregate and analyze data from multiple heterogeneous sources while adhering to pre-existing data-sharing policies for each integrated dataset and respecting participant preferences and informed consent. Specifically, we describe two primary data-sharing mechanisms-coordinated dbGaP applications and a Consortium Data Sharing Agreement-and provide alternatives when individual-level data cannot be shared within the Consortium (e.g., federated analyses). We also describe technical implementation of Consortium data sharing in the NHGRI Analysis Visualization and Informatics Lab-space (AnVIL) cloud platform to share derived individual-level data, genomic summary results, and methods workflows with appropriate permissions. As a consortium making secondary use of pre-existing data sources, we also discuss challenges and propose solutions for release of individual- and summary-level data products to the broader scientific community. We make recommendations for ongoing and future policymaking with the goal of informing future consortia and other research activities.
共享多样化的基因组和其他生物医学数据集对于推动科学发现及其公平转化以改善人类健康至关重要。然而,在遗留数据集、不断演变的政策、多机构联盟科学以及国际利益相关者的背景下,数据共享仍然具有挑战性。由美国国立卫生研究院资助的不同人群多基因风险方法(PRIMED)联盟的成立,旨在提高针对具有全球影响的广泛健康和疾病结局的多基因风险估计的性能。要在基因多样化的人群中提高多基因风险评分的性能,需要获取大量、多样化的队列。我们报告了PRIMED制定的数据共享政策和程序的设计与实施情况,这些政策和程序用于汇总和分析来自多个异构源的数据,同时遵守每个整合数据集预先存在的数据共享政策,并尊重参与者的偏好和知情同意。具体而言,我们描述了两种主要的数据共享机制——协调的dbGaP应用程序和联盟数据共享协议——并在联盟内无法共享个体层面数据时提供替代方案(例如联邦分析)。我们还描述了在国家人类基因组研究所分析可视化和信息学实验室空间(AnVIL)云平台上联盟数据共享的技术实施情况,以便在获得适当许可的情况下共享派生的个体层面数据、基因组汇总结果和方法工作流程。作为一个对现有数据源进行二次利用的联盟,我们还讨论了向更广泛的科学界发布个体层面和汇总层面数据产品所面临的挑战并提出解决方案。我们为正在进行的和未来的政策制定提出建议,目的是为未来的联盟和其他研究活动提供参考。