Santafé Guzmán, Lozano Jose A, Larrañaga Pedro
Computer Science and Artificial Intelligence Department, University of the Basque Country, San Sebastian, Spain.
J Comput Biol. 2008 Mar;15(2):207-20. doi: 10.1089/cmb.2007.0051.
The analysis of the structure of populations on the basis of genetic data is essential in population genetics. It is used, for instance, to study the evolution of species or to correct for population stratification in association studies. These genetic data, normally based on DNA polymorphisms, may contain irrelevant information that biases the inference of population structure. In this paper we adapt a recently proposed algorithm, named multistart EMA, to be used in the inference of population structure. This algorithm is able to deal with irrelevant information when obtaining the (probabilistic) population partition. Additionally, we present a maker selection test able to obtain the most relevant markers to retrieve that population partition. The proposed algorithm is compared with the widely used STRUCTURE software on the basis of the F(ST) metric and the log-likelihood score. It is shown that the proposed algorithm improves the obtention of the population structure. Moreover, information about relevant markers obtained by the multi-start EMA can be used to improve the results obtained by other methods, correct for population stratification or even also reduce the economical cost of sequencing new samples. The software presented in this paper is available online at http://www.sc.ehu.es/ccwbayes/members/guzman.
基于遗传数据对种群结构进行分析在群体遗传学中至关重要。例如,它被用于研究物种的进化或在关联研究中校正种群分层。这些通常基于DNA多态性的遗传数据可能包含会使种群结构推断产生偏差的无关信息。在本文中,我们采用了一种最近提出的名为多起点期望最大化算法(multistart EMA)的算法,用于种群结构的推断。该算法在获取(概率性)种群划分时能够处理无关信息。此外,我们提出了一种标记选择测试,能够获取最相关的标记以检索该种群划分。基于F(ST)指标和对数似然分数,将所提出的算法与广泛使用的STRUCTURE软件进行比较。结果表明,所提出的算法改进了种群结构的获取。此外,通过多起点期望最大化算法获得的关于相关标记的信息可用于改善其他方法得到的结果、校正种群分层,甚至还能降低对新样本进行测序的经济成本。本文所介绍的软件可在http://www.sc.ehu.es/ccwbayes/members/guzman在线获取。