Schmitz Carley Cari A, Coombs Joseph J, Douches David S, Bethke Paul C, Palta Jiwan P, Novy Richard G, Endelman Jeffrey B
Department of Horticulture, University of Wisconsin, Madison, WI, 53706, USA.
Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, 48824, USA.
Theor Appl Genet. 2017 Apr;130(4):717-726. doi: 10.1007/s00122-016-2845-5. Epub 2017 Jan 9.
New software to make tetraploid genotype calls from SNP array data was developed, which uses hierarchical clustering and multiple F1 populations to calibrate the relationship between signal intensity and allele dosage. SNP arrays are transforming breeding and genetics research for autotetraploids. To fully utilize these arrays, the relationship between signal intensity and allele dosage must be calibrated for each marker. We developed an improved computational method to automate this process, which is provided as the R package ClusterCall. In the training phase of the algorithm, hierarchical clustering within an F1 population is used to group samples with similar intensity values, and allele dosages are assigned to clusters based on expected segregation ratios. In the prediction phase, multiple F1 populations and the prediction set are clustered together, and the genotype for each cluster is the mode of the training set samples. A concordance metric, defined as the proportion of training set samples equal to the mode, can be used to eliminate unreliable markers and compare different algorithms. Across three potato families genotyped with an 8K SNP array, ClusterCall scored 5729 markers with at least 0.95 concordance (94.6% of its total), compared to 5325 with the software fitTetra (82.5% of its total). The three families were used to predict genotypes for 5218 SNPs in the SolCAP diversity panel, compared with 3521 SNPs in a previous study in which genotypes were called manually. One of the additional markers produced a significant association for vine maturity near a well-known causal locus on chromosome 5. In conclusion, when multiple F1 populations are available, ClusterCall is an efficient method for accurate, autotetraploid genotype calling that enables the use of SNP data for research and plant breeding.
开发了一种从SNP阵列数据中进行四倍体基因型分型的新软件,该软件使用层次聚类和多个F1群体来校准信号强度与等位基因剂量之间的关系。SNP阵列正在改变同源四倍体的育种和遗传学研究。为了充分利用这些阵列,必须针对每个标记校准信号强度与等位基因剂量之间的关系。我们开发了一种改进的计算方法来自动化这一过程,该方法以R包ClusterCall的形式提供。在算法的训练阶段,F1群体内的层次聚类用于对具有相似强度值的样本进行分组,并根据预期的分离比例为聚类分配等位基因剂量。在预测阶段,将多个F1群体和预测集聚类在一起,每个聚类的基因型是训练集样本的众数。一个一致性指标,定义为等于众数的训练集样本的比例,可用于消除不可靠的标记并比较不同的算法。在使用8K SNP阵列进行基因分型的三个马铃薯家族中,ClusterCall对5729个标记的评分一致性至少为0.95(占其总数的94.6%),相比之下,软件fitTetra对5325个标记的评分一致性为0.95(占其总数的82.5%)。这三个家族用于预测SolCAP多样性面板中5218个SNP的基因型,相比之下,在之前一项手动进行基因型分型的研究中有3521个SNP。其中一个额外的标记在5号染色体上一个著名的因果位点附近与葡萄成熟度产生了显著关联。总之,当有多个F1群体可用时,ClusterCall是一种准确进行同源四倍体基因型分型的有效方法,能够将SNP数据用于研究和植物育种。