Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae290.
Large sample datasets have been regarded as the primary basis for innovative discoveries and the solution to missing heritability in genome-wide association studies. However, their computational complexity cannot consider all comprehensive effects and all polygenic backgrounds, which reduces the effectiveness of large datasets. To address these challenges, we included all effects and polygenic backgrounds in a mixed logistic model for binary traits and compressed four variance components into two. The compressed model combined three computational algorithms to develop an innovative method, called FastBiCmrMLM, for large data analysis. These algorithms were tailored to sample size, computational speed, and reduced memory requirements. To mine additional genes, linkage disequilibrium markers were replaced by bin-based haplotypes, which are analyzed by FastBiCmrMLM, named FastBiCmrMLM-Hap. Simulation studies highlighted the superiority of FastBiCmrMLM over GMMAT, SAIGE and fastGWA-GLMM in identifying dominant, small α (allele substitution effect), and rare variants. In the UK Biobank-scale dataset, we demonstrated that FastBiCmrMLM could detect variants as small as 0.03% and with α ≈ 0. In re-analyses of seven diseases in the WTCCC datasets, 29 candidate genes, with both functional and TWAS evidence, around 36 variants identified only by the new methods, strongly validated the new methods. These methods offer a new way to decipher the genetic architecture of binary traits and address the challenges outlined above.
大样本数据集一直被视为创新发现的主要基础,也是解决全基因组关联研究中遗传缺失的方法。然而,它们的计算复杂性不能考虑所有的综合效应和多基因背景,这降低了大数据集的效果。为了解决这些挑战,我们在用于二项特征的混合逻辑模型中包含了所有的效应和多基因背景,并将四个方差分量压缩为两个。压缩模型结合了三种计算算法,开发了一种名为 FastBiCmrMLM 的创新方法,用于大规模数据分析。这些算法针对样本量、计算速度和减少内存需求进行了定制。为了挖掘更多的基因,连锁不平衡标记被基于 bin 的单倍型所取代,这些单倍型由 FastBiCmrMLM 进行分析,称为 FastBiCmrMLM-Hap。模拟研究强调了 FastBiCmrMLM 在识别显性、小 α(等位基因替换效应)和罕见变异方面优于 GMMAT、SAIGE 和 fastGWA-GLMM。在 UK Biobank 规模的数据集上,我们证明了 FastBiCmrMLM 可以检测到小至 0.03%的变异,且 α≈0。在对 WTCCC 数据集的七种疾病的重新分析中,有 29 个候选基因,它们既有功能证据,也有 TWAS 证据,大约 36 个仅由新方法识别的变异,强烈验证了新方法的有效性。这些方法为破译二项特征的遗传结构和解决上述挑战提供了一种新的方法。