Suppr超能文献

基于核主成分分析和随机森林的精细人口分层建模。

Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest.

机构信息

School of Computers, Guangdong University of Technology, Guangzhou, China.

出版信息

Genes Genomics. 2021 Oct;43(10):1143-1155. doi: 10.1007/s13258-021-01057-4. Epub 2021 Jun 7.

Abstract

BACKGROUND

Population stratification modeling is essential in Genome-Wide Association Studies.

OBJECTIVE

In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry.

METHODS

Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest.

RESULTS

Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively.

CONCLUSION

With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.

摘要

背景

群体结构分层建模在全基因组关联研究中至关重要。

目的

本文旨在构建精细的群体结构分层模型,以有效地推断个体遗传祖先。

方法

采用核主成分分析(PCA)和随机森林来构建群体结构分层模型,并进行参数优化。我们探索了不同的 PCA 方法,包括标准 PCA 和核 PCA,以从 LASER 软件的 vcf2geno 管道转换的基因型数据中提取相关特征。这些提取的特征被输入到随机森林中进行集成学习。通过联合寻找最佳主成分数量、PCA 的核函数以及随机森林的参数,进行参数调整。

结果

基于 HGDP 数据集的实验表明,核 PCA 与 Sigmoid 函数和高斯函数相结合可以比标准 PCA 获得更高的预测精度。与使用前两个主成分的标准 PCA 相比,使用最佳主成分数量的 KPCA-Sigmoid 的精度可以分别提高约 100%和 200%,用于东亚和欧洲人群。

结论

通过对 PCA 和随机森林的最优参数配置,我们的方法可以更准确地推断个体的遗传祖先,同时考虑到他们的变体。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验