Biscarini Filippo, Nazzicari Nelson, Broccanello Chiara, Stevanato Piergiorgio, Marini Simone
Department of Bioinformatics and Biostatistics, PTP Science Park, Via Einstein - Loc. Cascina Codazza, 26900 Lodi, Italy.
Council for Agricultural Research and Economics (CREA), Research Centre for Fodder Crops and Dairy Productions, Lodi, Italy.
Plant Methods. 2016 Jul 18;12:36. doi: 10.1186/s13007-016-0136-4. eCollection 2016.
Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding.
In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours-KNN, random forest-RF, ridge logistic regression-LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels-up to 50 %-were generated and introduced in the data.
Local classification methods-KNN and RF-showed higher tolerance to noisy labels compared to methods that leverage global data properties-LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise.
科学数据中的噪声(误差)普遍存在,可能对统计分析和实验结果产生不利影响。在人类医学的病例对照实验的全基因组关联研究中,已评估了噪声数据的影响。然而,关于噪声数据对基因组预测(动植物育种中广泛使用的统计应用)的影响知之甚少。
在本研究中,调查了五种分类方法(K近邻-KNN、随机森林-RF、岭逻辑回归-LR以及具有线性或径向基函数核的支持向量机)对数据噪声的敏感性。使用了一个由123株甜菜组成的群体,对其进行了二元性状表型分析,并对192个单核苷酸多态性(SNP)标记进行了基因分型。通过随机采样标签(0/1表型)来产生噪声。从标签无误差的基础情况开始,生成并在数据中引入比例不断增加的噪声标签,最高可达50%。
与利用全局数据属性的方法-LR和两个支持向量机模型相比,局部分类方法-KNN和RF-对噪声标签表现出更高的耐受性。特别是,KNN在高达20%的噪声标签情况下,其曲线下面积(AUC)高于0.95,优于所有其他分类器。排名第二的方法RF,在有20%噪声时的AUC为0.941。