Marshall Business School, University of Southern California, Los Angeles CA, 90089, United States.
Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, CB3 0WB, United Kingdom.
Biometrics. 2024 Jul 1;80(3). doi: 10.1093/biomtc/ujae060.
The increasing availability and scale of biobanks and "omic" datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of "signal" genes with those of "noise" genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating ("bagging") algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene-trait clusters and suggests multiple new hypotheses for future investigations.
生物银行和“组学”数据集的可用性和规模不断增加,为理解生物机制带来了新的视野。PathGPS 是一种探索性数据分析工具,可使用全基因组关联研究 (GWAS) 汇总数据来发现遗传结构。PathGPS 基于线性结构方程模型,其中特征受遗传和环境途径的共同调节。PathGPS 通过对比“信号”基因与“噪声”基因的 GWAS 关联来分离遗传和环境成分。从估计的遗传成分中,PathGPS 然后通过主成分和因子分析提取遗传途径,利用低秩和稀疏特性。此外,我们还提供了一种引导聚合(“bagging”)算法,以在数据扰动和超参数调整下提高稳定性。当应用于代谢组学数据集和英国生物库时,PathGPS 证实了几个已知的基因-性状簇,并为未来的研究提出了多个新假设。