Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.
Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China.
BMC Bioinformatics. 2020 Aug 14;21(1):357. doi: 10.1186/s12859-020-03653-9.
Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen.
The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble.
When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.
先前的研究报告指出,组学数据中标签错误并不罕见。潜在的异常值可能严重破坏患者的正确分类和特定疾病可靠生物标志物的识别。已经提出了三种方法来解决这个问题:稀疏标签噪声稳健逻辑回归(Rlogreg)、基于最小修剪平方的稳健弹性网络(enetLTS)和 Ensemble。Ensemble 是一种基于不同特征选择和建模策略的集成分类方法。需要评估和比较这些方法的生物标志物选择和异常值检测的准确性,以便选择合适的方法。
针对模拟数据集和 RNA-seq 数据集,比较了三种方法(Ensemble、enetLTS、Rlogreg)的变量选择准确性、异常值识别和预测。在模拟数据集中,Ensemble 的综合指标变量选择准确性最高,三种方法中假发现率最低。当样本量较大且异常值的比例≤5%时,Ensemble 的阳性选择率与 enetLTS 相似。但是,当异常值的比例为 10%或 15%时,Ensemble 错过了一些影响因变量的变量。总体而言,enetLTS 的异常值检测准确性最高,假阳性率<0.05,灵敏度高,当异常值的比例较大时,enetLTS 仍能很好地发挥作用。在异常值比例为 1%或 2%时,Ensemble 显示出高的异常值检测准确性,但异常值的比例较高时,Ensemble 错过了许多误标记的样本。Rlogreg 和 Ensemble 在识别异常值方面不如 enetLTS 准确。enetLTS 的预测准确性优于 Rlogreg。在使用 enetLTS 识别的异常值子集上运行 Ensemble,可以提高 Ensemble 的变量选择准确性。
当异常值的比例≤5%时,可以使用 Ensemble 进行变量选择。当异常值的比例>5%时,可以在使用 enetLTS 识别异常值之后,在异常值子集上使用 Ensemble 进行变量选择。对于异常值识别,推荐使用 enetLTS。在实践中,可以根据使用的诊断方法的不准确性来估计异常值的比例。