College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
Hum Genomics. 2018 May 9;12(1):25. doi: 10.1186/s40246-018-0156-4.
The analysis of population structure has many applications in medical and population genetic research. Such analysis is used to provide clear insight into the underlying genetic population substructure and is a crucial prerequisite for any analysis of genetic data. The analysis involves grouping individuals into subpopulations based on shared genetic variations. The most widely used markers to study the variation of DNA sequences between populations are single nucleotide polymorphisms. Data preprocessing is a necessary step to assess the quality of the data and to determine which markers or individuals can reasonably be included in the analysis. After preprocessing, several methods can be utilized to uncover population substructure, which can be categorized into two broad approaches: parametric and nonparametric. Parametric approaches use statistical models to infer population structure and assign individuals into subpopulations. However, these approaches suffer from many drawbacks that make them impractical for large datasets. In contrast, nonparametric approaches do not suffer from these drawbacks, making them more viable than parametric approaches for analyzing large datasets. Consequently, nonparametric approaches are increasingly used to reveal population substructure. Thus, this paper reviews and discusses the nonparametric approaches that are available for population structure analysis along with some implications to resolve challenges.
群体结构分析在医学和人口遗传学研究中有许多应用。这种分析用于提供对潜在遗传群体亚结构的清晰洞察,是对任何遗传数据分析的关键前提。该分析涉及根据共享的遗传变异将个体分为亚群。最广泛用于研究人群之间 DNA 序列变异的标记是单核苷酸多态性。数据预处理是评估数据质量并确定哪些标记或个体可以合理地包含在分析中的必要步骤。预处理后,可以使用几种方法来揭示群体亚结构,这些方法可以分为两类:参数和非参数。参数方法使用统计模型来推断群体结构并将个体分配到亚群中。然而,这些方法存在许多缺点,使得它们对于大型数据集来说不切实际。相比之下,非参数方法没有这些缺点,因此对于分析大型数据集来说,比参数方法更可行。因此,非参数方法越来越多地用于揭示群体亚结构。因此,本文综述并讨论了用于群体结构分析的非参数方法,以及解决挑战的一些启示。