• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于核主成分分析和随机森林的精细人口分层建模。

Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest.

机构信息

School of Computers, Guangdong University of Technology, Guangzhou, China.

出版信息

Genes Genomics. 2021 Oct;43(10):1143-1155. doi: 10.1007/s13258-021-01057-4. Epub 2021 Jun 7.

DOI:10.1007/s13258-021-01057-4
PMID:34097252
Abstract

BACKGROUND

Population stratification modeling is essential in Genome-Wide Association Studies.

OBJECTIVE

In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry.

METHODS

Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest.

RESULTS

Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively.

CONCLUSION

With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.

摘要

背景

群体结构分层建模在全基因组关联研究中至关重要。

目的

本文旨在构建精细的群体结构分层模型,以有效地推断个体遗传祖先。

方法

采用核主成分分析(PCA)和随机森林来构建群体结构分层模型,并进行参数优化。我们探索了不同的 PCA 方法,包括标准 PCA 和核 PCA,以从 LASER 软件的 vcf2geno 管道转换的基因型数据中提取相关特征。这些提取的特征被输入到随机森林中进行集成学习。通过联合寻找最佳主成分数量、PCA 的核函数以及随机森林的参数,进行参数调整。

结果

基于 HGDP 数据集的实验表明,核 PCA 与 Sigmoid 函数和高斯函数相结合可以比标准 PCA 获得更高的预测精度。与使用前两个主成分的标准 PCA 相比,使用最佳主成分数量的 KPCA-Sigmoid 的精度可以分别提高约 100%和 200%,用于东亚和欧洲人群。

结论

通过对 PCA 和随机森林的最优参数配置,我们的方法可以更准确地推断个体的遗传祖先,同时考虑到他们的变体。

相似文献

1
Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest.基于核主成分分析和随机森林的精细人口分层建模。
Genes Genomics. 2021 Oct;43(10):1143-1155. doi: 10.1007/s13258-021-01057-4. Epub 2021 Jun 7.
2
Evaluation of methods for adjusting population stratification in genome-wide association studies: Standard versus categorical principal component analysis.全基因组关联研究中调整群体分层方法的评估:标准主成分分析与分类主成分分析
Ann Hum Genet. 2019 Nov;83(6):454-464. doi: 10.1111/ahg.12339. Epub 2019 Jul 19.
3
Deep Kernel Principal Component Analysis for multi-level feature learning.深度核主成分分析用于多层次特征学习。
Neural Netw. 2024 Feb;170:578-595. doi: 10.1016/j.neunet.2023.11.045. Epub 2023 Nov 30.
4
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop:一种无需主成分分析即可基于距离推断个体祖先的快速方法,适用于多种基因型数据集。
G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.
5
Kernel principal components based cascade forest towards disease identification with human microbiota.基于核主成分的级联森林在人类微生物群系疾病识别中的应用
BMC Med Inform Decis Mak. 2021 Dec 23;21(1):360. doi: 10.1186/s12911-021-01705-5.
6
Application of kernel principal component analysis for single-lead-ECG-derived respiration.核主成分分析在单导联心电图衍生呼吸中的应用。
IEEE Trans Biomed Eng. 2012 Apr;59(4):1169-76. doi: 10.1109/TBME.2012.2186448.
7
Improvement of variables interpretability in kernel PCA.核主成分分析中变量可解释性的改进。
BMC Bioinformatics. 2023 Jul 12;24(1):282. doi: 10.1186/s12859-023-05404-y.
8
RKF-PCA: robust kernel fuzzy PCA.RKF-PCA:鲁棒核模糊主成分分析
Neural Netw. 2009 Jul-Aug;22(5-6):642-50. doi: 10.1016/j.neunet.2009.06.013. Epub 2009 Jun 30.
9
Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.用于全基因组关联研究分层校正的空间遗传血统新型概率模型。
Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.
10
Particle Swarm Optimized Hybrid Kernel-Based Multiclass Support Vector Machine for Microarray Cancer Data Analysis.基于粒子群优化混合核的多类支持向量机在微阵列癌症数据分析中的应用。
Biomed Res Int. 2019 Dec 14;2019:4085725. doi: 10.1155/2019/4085725. eCollection 2019.

引用本文的文献

1
Machine Learning and Causal Approaches to Predict Readmissions and Its Economic Consequences Among Canadian Patients With Heart Disease: Retrospective Study.机器学习与因果方法预测加拿大心脏病患者再入院情况及其经济后果:回顾性研究
JMIR Form Res. 2023 May 26;7:e41725. doi: 10.2196/41725.

本文引用的文献

1
Population Stratification in Genetic Association Studies.基因关联研究中的群体分层
Curr Protoc Hum Genet. 2017 Oct 18;95:1.22.1-1.22.23. doi: 10.1002/cphg.48.