临床蛋白质组学中生物标志物发现的特征选择方法的批判性评估。

In this paper, we compare the performance of six different feature selection methods for LC-MS-based proteomics and metabolomics biomarker discovery-t test, the Mann-Whitney-Wilcoxon test (mww test), nearest shrunken centroid (NSC), linear support vector machine-recursive features elimination (SVM-RFE), principal component discriminant analysis (PCDA), and partial least squares discriminant analysis (PLSDA)-using human urine and porcine cerebrospinal fluid samples that were spiked with a range of peptides at different concentration levels. The ideal feature selection method should select the complete list of discriminating features that are related to the spiked peptides without selecting unrelated features. Whereas many studies have to rely on classification error to judge the reliability of the selected biomarker candidates, we assessed the accuracy of selection directly from the list of spiked peptides. The feature selection methods were applied to data sets with different sample sizes and extents of sample class separation determined by the concentration level of spiked compounds. For each feature selection method and data set, the performance for selecting a set of features related to spiked compounds was assessed using the harmonic mean of the recall and the precision (f-score) and the geometric mean of the recall and the true negative rate (g-score). We conclude that the univariate t test and the mww test with multiple testing corrections are not applicable to data sets with small sample sizes (n = 6), but their performance improves markedly with increasing sample size up to a point (n > 12) at which they outperform the other methods. PCDA and PLSDA select small feature sets with high precision but miss many true positive features related to the spiked peptides. NSC strikes a reasonable compromise between recall and precision for all data sets independent of spiking level and number of samples. Linear SVM-RFE performs poorly for selecting features related to the spiked compounds, even though the classification error is relatively low.

在本文中，我们比较了六种不同的特征选择方法在基于 LC-MS 的蛋白质组学和代谢组学生物标志物发现中的性能——t 检验、Mann-Whitney-Wilcoxon 检验（mww 检验）、最近收缩中心（NSC）、线性支持向量机递归特征消除（SVM-RFE）、主成分判别分析（PCDA）和偏最小二乘判别分析（PLSDA）——使用人类尿液和猪脑脊液样本，这些样本中加入了一系列不同浓度的肽。理想的特征选择方法应该选择与加标肽相关的完整鉴别特征集，而不选择不相关的特征。虽然许多研究都依赖于分类错误来判断所选生物标志物候选物的可靠性，但我们直接从加标肽的列表中评估选择的准确性。特征选择方法应用于不同样本大小和样本类别分离程度的数据集中，这些程度由加标化合物的浓度水平决定。对于每种特征选择方法和数据集，使用召回率和精度（f 分数）的调和平均值以及召回率和真阴性率（g 分数）的几何平均值来评估选择与加标化合物相关的特征集的性能。我们得出的结论是，单变量 t 检验和 mww 检验与多重检验校正不适用于样本量较小（n=6）的数据集，但随着样本量的增加，其性能显著提高，直到某个点（n>12），它们的性能优于其他方法。PCDA 和 PLSDA 选择具有高精度的小特征集，但错过了与加标肽相关的许多真阳性特征。NSC 在不依赖加标水平和样本数量的情况下，为所有数据集在召回率和精度之间取得了合理的平衡。线性 SVM-RFE 在选择与加标化合物相关的特征方面表现不佳，尽管分类错误相对较低。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

A critical assessment of feature selection methods for biomarker discovery in clinical proteomics.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具