Suppr超能文献

利用基因微阵列和蛋白质组质谱数据进行类别预测与发现:问题、注意事项、警示

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.

作者信息

Somorjai R L, Dolenko B, Baumgartner R

机构信息

Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada R3B 1Y6.

出版信息

Bioinformatics. 2003 Aug 12;19(12):1484-91. doi: 10.1093/bioinformatics/btg182.

Abstract

MOTIVATION

Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease.

RESULTS

Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.

摘要

动机

两个实际情况限制了对微阵列数据、蛋白质组学的质谱以及生物医学红外或磁共振光谱的分析。一是“维度诅咒”:表征这些数据的特征数量达数千或数万。另一个是“数据集稀疏诅咒”:样本数量有限。当使用此类数据对疾病的存在与否进行分类时,这两个诅咒的影响是深远的。

结果

使用非常简单的分类器,我们针对几个公开可用的微阵列和蛋白质组学数据集展示了这些诅咒如何影响分类结果。特别是,即使通过特征提取/约简方法将每个特征的样本比率提高到推荐的5至10,数据集稀疏性仍可能使任何分类结果在统计上受到质疑。此外,对于稀疏数据集通常可以识别出几个“最优”特征集,所有这些特征集对于训练集和独立验证集都能产生完美的分类结果。这种非唯一性导致解释困难,并使人对这些“最优”特征集中任何一个特征集的生物学相关性产生怀疑。我们提出了一种方法来评估明显同样优秀的分类器的相对质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验