School of Computers, Guangdong University of Technology, Guangzhou, China.
Genes Genomics. 2021 Oct;43(10):1143-1155. doi: 10.1007/s13258-021-01057-4. Epub 2021 Jun 7.
Population stratification modeling is essential in Genome-Wide Association Studies.
In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry.
Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest.
Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively.
With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.
群体结构分层建模在全基因组关联研究中至关重要。
本文旨在构建精细的群体结构分层模型,以有效地推断个体遗传祖先。
采用核主成分分析(PCA)和随机森林来构建群体结构分层模型,并进行参数优化。我们探索了不同的 PCA 方法,包括标准 PCA 和核 PCA,以从 LASER 软件的 vcf2geno 管道转换的基因型数据中提取相关特征。这些提取的特征被输入到随机森林中进行集成学习。通过联合寻找最佳主成分数量、PCA 的核函数以及随机森林的参数,进行参数调整。
基于 HGDP 数据集的实验表明,核 PCA 与 Sigmoid 函数和高斯函数相结合可以比标准 PCA 获得更高的预测精度。与使用前两个主成分的标准 PCA 相比,使用最佳主成分数量的 KPCA-Sigmoid 的精度可以分别提高约 100%和 200%,用于东亚和欧洲人群。
通过对 PCA 和随机森林的最优参数配置,我们的方法可以更准确地推断个体的遗传祖先,同时考虑到他们的变体。