Khattab Ahmed, Chen Shang-Fu, Wineinger Nathan, Torkamani Ali
Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, USA.
Scripps Research Translational Institute, La Jolla, California, 92037, USA.
bioRxiv. 2024 Jul 16:2024.07.11.603165. doi: 10.1101/2024.07.11.603165.
The All of Us (AoU) Research Program provides a comprehensive genomic dataset to accelerate health research and medical breakthroughs. Despite its potential, researchers face significant challenges, including high costs and inefficiencies associated with data extraction and analysis. AoUPRS addresses these challenges by offering a versatile and cost-effective tool for calculating polygenic risk scores (PRS), enabling both experienced and novice researchers to leverage the AoU dataset for significant genomic discoveries.
AoUPRS is implemented in Python and utilizes the Hail framework for genomic data analysis. It offers two distinct approaches for PRS calculation: the Hail MatrixTable (MT) and the Hail Variant Dataset (VDS). The MT approach provides a dense representation of genotype data, while the VDS approach offers a sparse representation, significantly reducing computational costs. In performance evaluations, the VDS approach demonstrated a cost reduction of up to 99.51% for smaller scores and 85% for larger scores compared to the MT approach. Both approaches yielded similar predictive power, as shown by logistic regression analyses of PRS for coronary artery disease, atrial fibrillation, and type 2 diabetes. The empirical cumulative distribution functions (ECDFs) for PRS values further confirmed the consistency between the two methods.
AoUPRS is a versatile and cost-effective tool that addresses the high costs and inefficiencies associated with PRS calculations using the AoU dataset. By offering both dense and sparse data processing approaches, AoUPRS allows researchers to choose the approach best suited to their needs, facilitating genomic discoveries. The tool's open-source availability on GitHub, coupled with detailed documentation and tutorials, ensures accessibility and ease of use for the scientific community.
“我们所有人”(AoU)研究计划提供了一个全面的基因组数据集,以加速健康研究和医学突破。尽管具有潜力,但研究人员面临重大挑战,包括与数据提取和分析相关的高成本和低效率。AoUPRS通过提供一种通用且经济高效的工具来计算多基因风险评分(PRS),使经验丰富和新手研究人员都能够利用AoU数据集进行重大的基因组发现。
AoUPRS用Python实现,并利用Hail框架进行基因组数据分析。它提供了两种不同的PRS计算方法:Hail矩阵表(MT)和Hail变异数据集(VDS)。MT方法提供了基因型数据的密集表示,而VDS方法提供了稀疏表示,显著降低了计算成本。在性能评估中,与MT方法相比,VDS方法对于较小评分显示成本降低高达99.51%,对于较大评分降低85%。两种方法产生了相似的预测能力,如对冠状动脉疾病、心房颤动和2型糖尿病的PRS进行逻辑回归分析所示。PRS值的经验累积分布函数(ECDF)进一步证实了两种方法之间的一致性。
AoUPRS是一种通用且经济高效的工具,解决了使用AoU数据集进行PRS计算时的高成本和低效率问题。通过提供密集和稀疏数据处理方法,AoUPRS允许研究人员选择最适合其需求的方法,促进基因组发现。该工具在GitHub上的开源可用性,加上详细的文档和教程,确保了科学界能够访问并易于使用。