“嘈杂的甜菜”：表型错误对甜菜二元性状基因组预测的影响

"Noisy beets": impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris.

作者信息

Biscarini Filippo, Nazzicari Nelson, Broccanello Chiara, Stevanato Piergiorgio, Marini Simone

机构信息

Department of Bioinformatics and Biostatistics, PTP Science Park, Via Einstein - Loc. Cascina Codazza, 26900 Lodi, Italy.

Council for Agricultural Research and Economics (CREA), Research Centre for Fodder Crops and Dairy Productions, Lodi, Italy.

出版信息

Plant Methods. 2016 Jul 18;12:36. doi: 10.1186/s13007-016-0136-4. eCollection 2016.

DOI:10.1186/s13007-016-0136-4

PMID:27437026

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4949885/

Abstract

BACKGROUND

Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding.

RESULTS

In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours-KNN, random forest-RF, ridge logistic regression-LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels-up to 50 %-were generated and introduced in the data.

CONCLUSIONS

Local classification methods-KNN and RF-showed higher tolerance to noisy labels compared to methods that leverage global data properties-LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise.

摘要

背景

科学数据中的噪声（误差）普遍存在，可能对统计分析和实验结果产生不利影响。在人类医学的病例对照实验的全基因组关联研究中，已评估了噪声数据的影响。然而，关于噪声数据对基因组预测（动植物育种中广泛使用的统计应用）的影响知之甚少。

结果

在本研究中，调查了五种分类方法（K近邻-KNN、随机森林-RF、岭逻辑回归-LR以及具有线性或径向基函数核的支持向量机）对数据噪声的敏感性。使用了一个由123株甜菜组成的群体，对其进行了二元性状表型分析，并对192个单核苷酸多态性（SNP）标记进行了基因分型。通过随机采样标签（0/1表型）来产生噪声。从标签无误差的基础情况开始，生成并在数据中引入比例不断增加的噪声标签，最高可达50%。

结论

与利用全局数据属性的方法-LR和两个支持向量机模型相比，局部分类方法-KNN和RF-对噪声标签表现出更高的耐受性。特别是，KNN在高达20%的噪声标签情况下，其曲线下面积（AUC）高于0.95，优于所有其他分类器。排名第二的方法RF，在有20%噪声时的AUC为0.941。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a68f/4949885/229881144a2c/13007_2016_136_Fig1_HTML.jpg

相似文献

"Noisy beets": impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris.

Plant Methods. 2016 Jul 18;12:36. doi: 10.1186/s13007-016-0136-4. eCollection 2016.

The effect of mislabeled phenotypic status on the identification of mutation-carriers from SNP genotypes in dairy cattle.

BMC Res Notes. 2017 Jun 26;10(1):230. doi: 10.1186/s13104-017-2540-x.

Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets.

J Chem Inf Model. 2015 Jul 27;55(7):1413-25. doi: 10.1021/acs.jcim.5b00101. Epub 2015 Jun 18.

Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery.

Sensors (Basel). 2017 Dec 22;18(1):18. doi: 10.3390/s18010018.

A reliable method for colorectal cancer prediction based on feature selection and support vector machine.

Med Biol Eng Comput. 2019 Apr;57(4):901-912. doi: 10.1007/s11517-018-1930-0. Epub 2018 Nov 26.

Single_cell_GRN: gene regulatory network identification based on supervised learning method and Single-cell RNA-seq data.

BioData Min. 2022 Jun 11;15(1):13. doi: 10.1186/s13040-022-00297-8.

Classification of THz pulse signals using two-dimensional cross-correlation feature extraction and non-linear classifiers.

Comput Methods Programs Biomed. 2016 Apr;127:64-82. doi: 10.1016/j.cmpb.2016.01.017. Epub 2016 Feb 1.

Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups.

J Dairy Sci. 2016 Sep;99(9):7261-7273. doi: 10.3168/jds.2016-10887. Epub 2016 Jun 22.

Comparison of statistical learning approaches for cerebral aneurysm rupture assessment.

Int J Comput Assist Radiol Surg. 2020 Jan;15(1):141-150. doi: 10.1007/s11548-019-02065-2. Epub 2019 Sep 4.

MagIO: Magnetic Field Strength Based Indoor- Outdoor Detection with a Commercial Smartphone.

Micromachines (Basel). 2018 Oct 20;9(10):534. doi: 10.3390/mi9100534.

引用本文的文献

The effect of mislabeled phenotypic status on the identification of mutation-carriers from SNP genotypes in dairy cattle.

BMC Res Notes. 2017 Jun 26;10(1):230. doi: 10.1186/s13104-017-2540-x.

本文引用的文献

Machine Learning for High-Throughput Stress Phenotyping in Plants.

Trends Plant Sci. 2016 Feb;21(2):110-124. doi: 10.1016/j.tplants.2015.10.015. Epub 2015 Dec 1.

A new polymorphism on chromosome 6 associated with bolting tendency in sugar beet.

BMC Genet. 2015 Dec 7;16:142. doi: 10.1186/s12863-015-0300-2.

Lights, camera, action: high-throughput plant phenotyping is ready for a close-up.

Curr Opin Plant Biol. 2015 Apr;24:93-9. doi: 10.1016/j.pbi.2015.02.006. Epub 2015 Feb 27.

Predicting haplotype carriers from SNP genotypes in Bos taurus through linear discriminant analysis.

Genet Sel Evol. 2015 Feb 5;47(1):4. doi: 10.1186/s12711-015-0094-8.

The importance of phenotypic data analysis for genomic prediction - a case study comparing different spatial models in rye.

BMC Genomics. 2014 Aug 4;15(1):646. doi: 10.1186/1471-2164-15-646.

Genome-enabled predictions for binomial traits in sugar beet populations.

BMC Genet. 2014 Jul 22;15:87. doi: 10.1186/1471-2156-15-87.

Field high-throughput phenotyping: the new crop breeding frontier.

Trends Plant Sci. 2014 Jan;19(1):52-61. doi: 10.1016/j.tplants.2013.09.008. Epub 2013 Oct 16.

Whole-genome regression and prediction methods applied to plant and animal breeding.

Genetics. 2013 Feb;193(2):327-45. doi: 10.1534/genetics.112.143313. Epub 2012 Jun 28.

Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels.

J Dairy Sci. 2012 Jul;95(7):4114-29. doi: 10.3168/jds.2011-5019.

Multilocus association mapping using generalized ridge logistic regression.

BMC Bioinformatics. 2011 Sep 29;12:384. doi: 10.1186/1471-2105-12-384.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

“嘈杂的甜菜”：表型错误对甜菜二元性状基因组预测的影响

"Noisy beets": impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris.

作者信息

Biscarini Filippo, Nazzicari Nelson, Broccanello Chiara, Stevanato Piergiorgio, Marini Simone

机构信息

Department of Bioinformatics and Biostatistics, PTP Science Park, Via Einstein - Loc. Cascina Codazza, 26900 Lodi, Italy.

Council for Agricultural Research and Economics (CREA), Research Centre for Fodder Crops and Dairy Productions, Lodi, Italy.