Sng Letitia M F, Kaphle Anubhav, O'Brien Mitchell J, Hosking Brendan, Reguant Roc, Verjans Johan, Jain Yatish, Twine Natalie A, Bauer Denis C
Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, New South Wales, Australia.
Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Melbourne, Victoria, Australia.
Sci Rep. 2025 Mar 25;15(1):10335. doi: 10.1038/s41598-025-95286-2.
We conducted the first comprehensive association analysis of a coronary artery disease (CAD) cohort within the recently released UK Biobank (UKB) whole genome sequencing dataset. We employed fine mapping tool PolyFun and pinpoint rs10757274 as the most likely causal SNV within the 9p21.3 CAD risk locus. Notably, we show that machine-learning (ML) approaches, REGENIE and VariantSpark, exhibited greater sensitivity compared to traditional single-SNV logistic regression, uncovering rs28451064 a known risk locus in 21q22.11. Our findings underscore the utility of leveraging advanced computational techniques and cloud-based resources for mega-biobank analyses. Aligning with the paradigm shift of bringing compute to data, we demonstrate a 44% cost reduction and 94% speedup through compute architecture optimisation on UK Biobank's Research Analysis Platform using our RAPpoet approach. We discuss three considerations for researchers implementing novel workflows for datasets hosted on cloud-platforms, to pave the way for harnessing mega-biobank-sized data through scalable, cost-effective cloud computing solutions.
我们在最近发布的英国生物银行(UKB)全基因组测序数据集中,对冠心病(CAD)队列进行了首次全面的关联分析。我们使用精细定位工具PolyFun,并确定rs10757274是9p21.3 CAD风险基因座中最可能的因果单核苷酸变异(SNV)。值得注意的是,我们发现机器学习(ML)方法REGENIE和VariantSpark与传统的单SNV逻辑回归相比,表现出更高的灵敏度,发现了21q22.11中一个已知的风险基因座rs28451064。我们的研究结果强调了利用先进计算技术和基于云的资源进行大型生物银行分析的实用性。与将计算带到数据的范式转变相一致,我们通过使用我们的RAPpoet方法在英国生物银行的研究分析平台上进行计算架构优化,展示了44%的成本降低和94%的加速。我们讨论了研究人员为云平台上托管的数据集实施新工作流程时的三个注意事项,为通过可扩展、经济高效的云计算解决方案利用大型生物银行规模的数据铺平道路。