Graffelman Jan, Galván Femenía Iván, de Cid Rafael, Barceló Vidal Carles
Department of Statistics and Operations Research, Technical University of Catalonia, Barcelona, Spain.
Department of Biostatistics, University of Washington, Seattle, WA, United States.
Front Genet. 2019 Apr 24;10:341. doi: 10.3389/fgene.2019.00341. eCollection 2019.
The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.
在基于人群的大型队列中检测隐秘的亲缘关系在基因组研究中具有重要意义。检测密切相关个体的常用方法是在二维散点图中绘制基于状态相同或血统相同的等位基因共享统计量。这种方法忽略了个体间等位基因共享数据实际上具有更高的维度,也没有考虑共享基因型潜在计数的构成性质。在本文中,我们基于对数比率主成分分析开发了双标图方法,克服了这些限制。这产生了全新的图形,对于探索来自同质人群的遗传数据库中的亲缘关系非常有用。所提出的方法可以以迭代方式应用,充当用于更难分类的更远亲缘关系的观察镜。使用来自千人基因组计划和生命基因组 - GCAT计划的数据集来说明所提出的方法。在模拟研究中,将对数比率双标图方法的判别能力与经典图进行了比较。在一个非近亲繁殖的同质人群中,仅使用状态相同,对数比率主成分方法的分类率在整个等位基因频率谱上优于经典图形。在这些情况下,模拟表明,对于35,000个独立的双等位基因变体,对数比率主成分分析与判别分析相结合,可以正确分类直至包括第四级的亲缘关系。