Suppr超能文献

关于群体分层主成分分析中的罕见变异。

On rare variants in principal component analysis of population stratification.

作者信息

Ma Shengqing, Shi Gang

机构信息

State Key Laboratory of Integrated Services Networks, Xidian University, 2 South Taibai Road, Xi'an, 710071, Shaanxi, China.

出版信息

BMC Genet. 2020 Mar 17;21(1):34. doi: 10.1186/s12863-020-0833-x.

Abstract

BACKGROUND

Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used.

RESULTS

We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose F, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio F abates, the population distance d decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio F is 93.85, population distance d is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01.

CONCLUSIONS

The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.

摘要

背景

群体分层是全基因组关联研究中一个已知的混杂因素,因为它可能导致假阳性结果。主成分分析(PCA)方法广泛应用于常见变异的群体结构分析。然而,当使用罕见变异时,其分析性能仍不清楚。

结果

我们推导了遗传关系矩阵的数学期望。期望矩阵的方差和协方差元素明确取决于PCA分析中使用的遗传标记的等位基因频率。我们表明,群体间方差仅包含在K个主成分(PCs)中,且大多在最大的K - 1个主成分中,其中K是样本中的群体数量。我们提出F,即K个群体信息性主成分中群体间方差与群体内方差的比值,以及d,即群体间平方距离之和,作为群体差异的度量。我们通过分析表明,当等位基因频率变小时,比值F降低,群体距离d减小,K个主成分解释的方差比例减小。这些结果在千人基因组计划数据的分析中得到了验证。当使用等位基因频率在0.4至0.5之间的常见变异时,比值F为93.85,群体距离d为444.38,最大的五个主成分解释的方差为17.09%。然而,当使用频率在0.0001至0.01之间的罕见变异时,该比值、距离和百分比分别降至1.83、17.83和0.74%。

结论

群体分层的PCA在使用罕见变异时比使用常见变异时表现更差。在用测序数据分析群体分层时,有必要将选择限制在仅常见变异上。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验