Khattab Ahmed, Chen Shang-Fu, Wineinger Nathan, Torkamani Ali
Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, USA.
Scripps Research Translational Institute, 3344 North Torrey Pines Court, Suite 300, La Jolla, CA, 92037, USA.
BMC Genomics. 2025 May 22;26(1):521. doi: 10.1186/s12864-025-11693-9.
The All of Us (AoU) Research Program provides a comprehensive genomic dataset to accelerate health research and medical breakthroughs. Despite its potential, researchers face significant challenges, including high costs and inefficiencies associated with data extraction and analysis. AoUPRS addresses these challenges by offering a versatile and cost-effective tool for calculating polygenic risk scores (PRS), enabling both experienced and novice researchers to leverage the AoU dataset for large-scale genomic discoveries.
We evaluated three PRS models from the PGS Catalog (coronary artery disease, atrial fibrillation, and type 2 diabetes) using two distinct approaches in the Hail framework: MatrixTable (MT), a dense representation, and Variant Dataset (VDS), a sparse representation optimized for large-scale genomic data. Computational cost, resource usage, and processing time were compared. To assess the similarity of PRS performance between these two approaches, we compared odds ratios (ORs) and area under the curve (AUC). Lin's concordance correlation coefficient (CCC) was also computed to quantify agreement between PRS scores generated by MT and VDS.
The VDS approach reduced computational costs by up to 99.51% (e.g., from $32 to $0.036 for a 51-SNP score) while maintaining PRS estimates that were highly similar to those obtained using the MT approach. Across all three PRS models, AUC comparisons showed minimal differences between MT and VDS, indicating that both approaches yield consistent PRS performance. Agreement between PRS scores calculated by both approaches was further supported by Lin's CCC values ranging from 0.9199 to 0.9944, confirming strong concordance. Empirical cumulative distribution function (ECDF) plots further illustrated the near-identical distribution of PRS values across methods.
AoUPRS enables efficient and cost-effective PRS computation within AoU, providing substantial cost savings while maintaining highly consistent PRS estimates. These findings support the use of AoUPRS for large-scale genomic risk assessment, making the AoU dataset more accessible and practical for diverse research applications. The tool's open-source availability on GitHub, coupled with detailed documentation and tutorials, ensures accessibility and ease of use for the scientific community.
“我们所有人”(AoU)研究计划提供了一个全面的基因组数据集,以加速健康研究和医学突破。尽管具有潜力,但研究人员面临重大挑战,包括与数据提取和分析相关的高成本和低效率。AoUPRS通过提供一种通用且经济高效的工具来计算多基因风险评分(PRS),使经验丰富和新手研究人员都能够利用AoU数据集进行大规模基因组发现,从而应对这些挑战。
我们在Hail框架中使用两种不同的方法评估了PGS Catalog中的三种PRS模型(冠状动脉疾病、心房颤动和2型糖尿病):密集表示的矩阵表(MT)和针对大规模基因组数据优化的稀疏表示的变异数据集(VDS)。比较了计算成本、资源使用情况和处理时间。为了评估这两种方法之间PRS性能的相似性,我们比较了优势比(OR)和曲线下面积(AUC)。还计算了林氏一致性相关系数(CCC),以量化MT和VDS生成的PRS分数之间的一致性。
VDS方法将计算成本降低了高达99.51%(例如,对于一个51个单核苷酸多态性的分数,从32美元降至0.036美元),同时保持了与使用MT方法获得的PRS估计值高度相似。在所有三种PRS模型中,AUC比较显示MT和VDS之间差异最小,表明两种方法产生的PRS性能一致。两种方法计算的PRS分数之间的一致性进一步得到林氏CCC值在0.9199至0.9944之间的支持,证实了高度一致性。经验累积分布函数(ECDF)图进一步说明了不同方法之间PRS值的几乎相同分布。
AoUPRS能够在AoU内进行高效且经济高效的PRS计算,在保持高度一致的PRS估计值的同时节省大量成本。这些发现支持使用AoUPRS进行大规模基因组风险评估,使AoU数据集对各种研究应用更易于获取和实用。该工具在GitHub上的开源可用性,以及详细的文档和教程,确保了科学界能够访问并易于使用。