Department of Plant Breeding/Universidad Autónoma Agraria Antonio Narro, Saltillo, Coahuila, Mexico.
International Maize and Wheat Improvement Center (CIMMYT), Mexico City, Mexico.
PLoS One. 2018 Feb 28;13(2):e0193346. doi: 10.1371/journal.pone.0193346. eCollection 2018.
Germplasm banks are growing in their importance, number of accessions and amount of characterization data, with a large emphasis on molecular genetic markers. In this work, we offer an integrated view of accessions and marker data in an information theory framework. The basis of this development is the mutual information between accessions and allele frequencies for molecular marker loci, which can be decomposed in allele specificities, as well as in rarity and divergence of accessions. In this way, formulas are provided to calculate the specificity of the different marker alleles with reference to their distribution across accessions, accession rarity, defined as the weighted average of the specificity of its alleles, and divergence, defined by the Kullback-Leibler formula. Albeit being different measures, it is demonstrated that average rarity and divergence are equal for any collection. These parameters can contribute to the knowledge of the structure of a germplasm collection and to make decisions about the preservation of rare variants. The concepts herein developed served as the basis for a strategy for core subset selection called HCore, implemented in a publicly available R script. As a proof of concept, the mathematical view and tools developed in this research were applied to a large collection of Mexican wheat accessions, widely characterized by SNP markers. The most specific alleles were found to be private of a single accession, and the distribution of this parameter had its highest frequencies at low levels of specificity. Accession rarity and divergence had largely symmetrical distributions, and had a positive, albeit non-strictly linear relationship. Comparison of the HCore approach for core subset selection, with three state-of-the-art methods, showed it to be superior for average divergence and rarity, mean genetic distance and diversity. The proposed approach can be used for knowledge extraction and decision making in germplasm collections of diploid, inbred or outbred species.
种质库在重要性、访问量和特征描述数据量方面都在不断增加,其中很大一部分强调了分子遗传标记。在这项工作中,我们在信息论框架内提供了访问量和标记数据的综合视图。这一发展的基础是访问量和分子标记基因座等位基因频率之间的互信息,可以分解为等位基因特异性以及访问量的稀有性和多样性。通过这种方式,提供了公式来计算不同标记等位基因的特异性,参考其在访问量中的分布、定义为其等位基因特异性加权平均值的访问量稀有性以及通过 Kullback-Leibler 公式定义的多样性。尽管是不同的措施,但证明了任何集合的平均稀有性和多样性都是相等的。这些参数可以有助于了解种质资源收集的结构,并就保存稀有变体做出决策。本文所开发的概念为核心子集选择策略 HCore 提供了基础,该策略在一个可公开获取的 R 脚本中实现。作为概念验证,本文研究中开发的数学观点和工具被应用于广泛用 SNP 标记进行特征描述的大量墨西哥小麦访问量。发现最具特异性的等位基因是单个访问量所特有的,并且该参数的分布在特异性水平较低时具有最高频率。访问量稀有性和多样性的分布大致对称,并且存在正相关关系,尽管不是严格的线性关系。核心子集选择的 HCore 方法与三种最先进的方法的比较表明,它在平均多样性和稀有性、平均遗传距离和多样性方面具有优势。该方法可用于二倍体、自交或杂交物种的种质资源收集的知识提取和决策。