Biscarini Filippo, Schwarzenbacher Hermann, Pausch Hubert, Nicolazzi Ezequiel L, Pirola Yuri, Biffani Stefano
Department of Bioinformatics and Biostatistics, PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, 26900, Italy.
ZuchtData, Dresdner Straße 89/19, Wien, A-1200, Austria.
BMC Genomics. 2016 Nov 3;17(1):857. doi: 10.1186/s12864-016-3218-9.
SNP (single nucleotide polymorphisms) genotype data are increasingly available in cattle populations and, among other things, can be used to predict carriers of specific mutations. It is therefore convenient to have a practical statistical method for the accurate classification of individuals into carriers and non-carriers. In this paper, we compared - through cross-validation- five classification models (Lasso-penalized logistic regression -Lasso, Support Vector Machines with either linear or radial kernel -SVML and SVMR, k-nearest neighbors -KNN, and multi-allelic gene prediction -MAG), for the identification of carriers of the TUBD1 recessive mutation on BTA19 (Bos taurus autosome 19), known to be associated with high calf mortality. A population of 3116 Fleckvieh and 392 Brown Swiss animals genotyped with the 54K SNP-chip was available for the analysis.
In general, the use of SNP genotypes proved to be very effective for the identification of mutation carriers. The best predictive models were Lasso, SVML and MAG, with an average error rate, respectively, of 0.2 %, 0.4 % and 0.6 % in Fleckvieh, and 1.2 %, 0.9 % and 1.7 % in Brown Swiss. For the three models, the false positive rate was, respectively, 0.1 %, 0.1 % and 0.2 % in Fleckvieh, and 3.0 %, 2.4 % and 1.6 % in Brown Swiss; the false negative rate was 4.4 %, 7.6 %1.0 % in Fleckvieh, and 0.0 %, 0.1% and 0.8 % in Brown Swiss. MAG appeared to be more robust to sample size reduction: with 25 % of the data, the average error rate was 0.7 % and 2.2 % in Fleckvieh and Brown Swiss, compared to 2.1 % and 5.5 % with Lasso, and 2.6 % and 12.0 % with SVML.
The use of SNP genotypes is a very effective and efficient technique for the identification of mutation carriers in cattle populations. Very few misclassifications were observed, overall and both in the carriers and non-carriers classes. This indicates that this is a very reliable approach for potential applications in cattle breeding.
单核苷酸多态性(SNP)基因型数据在牛群中越来越容易获得,并且可用于预测特定突变的携带者等。因此,拥有一种实用的统计方法来准确地将个体分为携带者和非携带者是很方便的。在本文中,我们通过交叉验证比较了五种分类模型(套索惩罚逻辑回归 - 套索、具有线性或径向核的支持向量机 - SVML和SVMR、k近邻 - KNN以及多等位基因基因预测 - MAG),用于识别BTA19(牛常染色体19)上已知与高犊牛死亡率相关的TUBD1隐性突变的携带者。有一个由3116头弗莱维赫牛和392头瑞士褐牛组成的群体,用54K SNP芯片进行了基因分型,可用于分析。
总体而言,SNP基因型的使用被证明对于识别突变携带者非常有效。最佳预测模型是套索、SVML和MAG,在弗莱维赫牛中平均错误率分别为0.2%、0.4%和0.6%,在瑞士褐牛中分别为1.2%、0.9%和1.7%。对于这三种模型,在弗莱维赫牛中假阳性率分别为0.1%、0.1%和0.2%,在瑞士褐牛中分别为3.0%、2.4%和1.6%;在弗莱维赫牛中假阴性率分别为4.4%、7.6%和1.0%,在瑞士褐牛中分别为0.0%、0.1%和0.8%。MAG似乎对样本量减少更具鲁棒性:在数据量减少25%的情况下,在弗莱维赫牛和瑞士褐牛中平均错误率分别为0.7%和2.2%,而套索分别为2.1%和5.5%,SVML分别为2.6%和12.0%。
SNP基因型的使用是一种非常有效且高效的技术,用于识别牛群中的突变携带者。总体上以及在携带者和非携带者类别中都观察到极少的错误分类。这表明这是一种在牛育种潜在应用中非常可靠的方法。