Suppr超能文献

错误标注的表型状态对从奶牛单核苷酸多态性(SNP)基因型中识别突变携带者的影响。

The effect of mislabeled phenotypic status on the identification of mutation-carriers from SNP genotypes in dairy cattle.

作者信息

Biffani Stefano, Pausch Hubert, Schwarzenbacher Hermann, Biscarini Filippo

机构信息

IBBA-CNR, Via Einstein-Loc. Cascina Codazza, 26900, Lodi, Italy.

AIA: Associazione Italiana Allevatori, Via Giuseppe Tomassetti 9, 00161, Rome, Italy.

出版信息

BMC Res Notes. 2017 Jun 26;10(1):230. doi: 10.1186/s13104-017-2540-x.

Abstract

BACKGROUND

Statistical and machine learning applications are increasingly popular in animal breeding and genetics, especially to compute genomic predictions for phenotypes of interest. Noise (errors) in the data may have a negative impact on the accuracy of predictions. The effects of noisy data have been investigated in genome-wide association studies for case-control experiments, and in genomic predictions for binary traits in plants. No studies have been published yet on the impact of noisy data in animal genomics. In this work, the susceptibility to noise of five classification models (Lasso-penalised logistic regression-Lasso, K-nearest neighbours-KNN, random forest-RF, support vector machines with linear-SVML-or radial-SVMR-kernel) was tested. As illustration, the identification of carriers of a recessive mutation in cattle (Bos taurus) was used. A population of 3116 Fleckvieh animals with SNP genotypes on the same chromosome as the mutation locus (BTA 19) was available. The carrier status (0/1 phenotype) was randomly sampled to generate noise. Increasing proportions of noise-up to 20%- were introduced in the data.

RESULTS

SVMR and Lasso were relatively more robust to noise in the data, with total accuracy still above 0.975 and TPR (true positive rate; accuracy in the minority class) in the range 0.5-0.80 also with 17.5-20% mislabeled observations. The performance of SVML and RF decreased monotonically with increasing noise in the data, while KNN constantly failed to identify mutation carriers (observations in the minority class). The computation time increased with noise in the data, especially for the two support vector machines classifiers.

CONCLUSIONS

This work was the first to assess the impact of phenotyping errors on the accuracy of genomic predictions in animal genetics. The choice of the classification method can influence results in terms of higher or lower susceptibility to noise. In the presented problem, SVM with radial kernel performed relatively well even when the proportion of errors in the data reached 12.5%. Lasso was the second best method, while SVML, RF and KNN were very sensitive to noise. Taking into account both accuracy and computation time, Lasso provided the best combination.

摘要

背景

统计和机器学习应用在动物育种和遗传学中越来越受欢迎,特别是用于计算感兴趣表型的基因组预测。数据中的噪声(误差)可能会对预测的准确性产生负面影响。噪声数据的影响已在病例对照实验的全基因组关联研究以及植物二元性状的基因组预测中得到研究。尚未有关于噪声数据对动物基因组学影响的研究发表。在这项工作中,测试了五种分类模型(套索惩罚逻辑回归 - Lasso、K近邻 - KNN、随机森林 - RF、线性支持向量机 - SVML或径向支持向量机 - SVMR内核)对噪声的敏感性。作为示例,使用了牛(Bos taurus)中隐性突变携带者的鉴定。有一个由3116头弗莱维赫动物组成的群体,其单核苷酸多态性(SNP)基因型与突变位点在同一条染色体上(牛19号染色体,BTA 19)。随机抽取携带者状态(0/1表型)以产生噪声。在数据中引入了高达20%的噪声比例增加。

结果

SVMR和Lasso对数据中的噪声相对更具鲁棒性,即使在17.5% - 20%的观测值被错误标记的情况下,总准确率仍高于0.975,真阳性率(TPR;少数类中的准确率)在0.5 - 0.80范围内。SVML和RF的性能随着数据中噪声的增加而单调下降,而KNN始终无法识别突变携带者(少数类中的观测值)。计算时间随着数据中的噪声增加而增加,特别是对于两个支持向量机分类器。

结论

这项工作首次评估了表型错误对动物遗传学中基因组预测准确性的影响。分类方法的选择会在对噪声的敏感性高低方面影响结果。在提出的问题中,即使数据中的错误比例达到12.5%,具有径向内核的支持向量机表现也相对较好。Lasso是第二好的方法,而SVML、RF和KNN对噪声非常敏感。综合考虑准确性和计算时间,Lasso提供了最佳组合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4036/5485573/81bcfe148b14/13104_2017_2540_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验