• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用基因微阵列和蛋白质组质谱数据进行类别预测与发现:问题、注意事项、警示

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.

作者信息

Somorjai R L, Dolenko B, Baumgartner R

机构信息

Institute for Biodiagnostics, National Research Council Canada, Winnipeg, MB, Canada R3B 1Y6.

出版信息

Bioinformatics. 2003 Aug 12;19(12):1484-91. doi: 10.1093/bioinformatics/btg182.

DOI:10.1093/bioinformatics/btg182
PMID:12912828
Abstract

MOTIVATION

Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease.

RESULTS

Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.

摘要

动机

两个实际情况限制了对微阵列数据、蛋白质组学的质谱以及生物医学红外或磁共振光谱的分析。一是“维度诅咒”:表征这些数据的特征数量达数千或数万。另一个是“数据集稀疏诅咒”:样本数量有限。当使用此类数据对疾病的存在与否进行分类时,这两个诅咒的影响是深远的。

结果

使用非常简单的分类器,我们针对几个公开可用的微阵列和蛋白质组学数据集展示了这些诅咒如何影响分类结果。特别是,即使通过特征提取/约简方法将每个特征的样本比率提高到推荐的5至10,数据集稀疏性仍可能使任何分类结果在统计上受到质疑。此外,对于稀疏数据集通常可以识别出几个“最优”特征集,所有这些特征集对于训练集和独立验证集都能产生完美的分类结果。这种非唯一性导致解释困难,并使人对这些“最优”特征集中任何一个特征集的生物学相关性产生怀疑。我们提出了一种方法来评估明显同样优秀的分类器的相对质量。

相似文献

1
Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.利用基因微阵列和蛋白质组质谱数据进行类别预测与发现:问题、注意事项、警示
Bioinformatics. 2003 Aug 12;19(12):1484-91. doi: 10.1093/bioinformatics/btg182.
2
Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类
BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.
3
Simultaneous gene clustering and subset selection for sample classification via MDL.通过最小描述长度实现用于样本分类的同步基因聚类和子集选择
Bioinformatics. 2003 Jun 12;19(9):1100-9. doi: 10.1093/bioinformatics/btg039.
4
Classification algorithms for phenotype prediction in genomics and proteomics.基因组学和蛋白质组学中用于表型预测的分类算法。
Front Biosci. 2008 Jan 1;13:691-708. doi: 10.2741/2712.
5
Genetic algorithms applied to multi-class prediction for the analysis of gene expression data.应用于基因表达数据分析的多类预测的遗传算法。
Bioinformatics. 2003 Jan;19(1):37-44. doi: 10.1093/bioinformatics/19.1.37.
6
SamCluster: an integrated scheme for automatic discovery of sample classes using gene expression profile.SamCluster:一种利用基因表达谱自动发现样本类别的综合方案。
Bioinformatics. 2003 May 1;19(7):811-7. doi: 10.1093/bioinformatics/btg095.
7
Induction of comprehensible models for gene expression datasets by subgroup discovery methodology.通过子群发现方法为基因表达数据集诱导可理解模型。
J Biomed Inform. 2004 Aug;37(4):269-84. doi: 10.1016/j.jbi.2004.07.007.
8
Effective dimension reduction methods for tumor classification using gene expression data.使用基因表达数据进行肿瘤分类的有效降维方法。
Bioinformatics. 2003 Mar 22;19(5):563-70. doi: 10.1093/bioinformatics/btg062.
9
Is cross-validation better than resubstitution for ranking genes?在对基因进行排名时,交叉验证是否比重替代法更好?
Bioinformatics. 2004 Jan 22;20(2):253-8. doi: 10.1093/bioinformatics/btg399.
10
Bayesian automatic relevance determination algorithms for classifying gene expression data.用于基因表达数据分类的贝叶斯自动相关性确定算法。
Bioinformatics. 2002 Oct;18(10):1332-9. doi: 10.1093/bioinformatics/18.10.1332.

引用本文的文献

1
Multi-transcriptomics predicts clinical outcome in systemically untreated breast cancer patients with extensive follow-up.多转录组学可预测未经全身治疗且随访广泛的乳腺癌患者的临床结局。
Breast Cancer Res. 2025 Jul 15;27(1):133. doi: 10.1186/s13058-025-02061-2.
2
Classification of Breast Cancer Microarray Data and Identification of Responsible Genes Using Rough Set Theory.
Methods Mol Biol. 2025;2952:315-334. doi: 10.1007/978-1-0716-4690-8_19.
3
Auxiliary diagnosis of primary bone tumors based on Machine learning model.基于机器学习模型的原发性骨肿瘤辅助诊断
J Bone Oncol. 2024 Nov 9;49:100648. doi: 10.1016/j.jbo.2024.100648. eCollection 2024 Dec.
4
Ensemble-based classification using microRNA expression identifies a breast cancer patient subgroup with an ultralow long-term risk of metastases.基于微小RNA表达的集成分类法可识别出具有超低长期转移风险的乳腺癌患者亚组。
Cancer Med. 2024 May;13(9):e7089. doi: 10.1002/cam4.7089.
5
MV-CVIB: microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer.MV-CVIB:基于微生物组的多视图卷积变分信息瓶颈用于预测转移性结直肠癌
Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023.
6
Clustering of serum biomarkers involved in post-aneurysmal subarachnoid hemorrhage (aSAH) complications.与颅内动脉瘤性蛛网膜下腔出血(aSAH)后并发症相关的血清生物标志物聚类。
Neurosurg Rev. 2023 Mar 3;46(1):63. doi: 10.1007/s10143-023-01967-9.
7
Role of Artificial Intelligence and Machine Learning in Prediction, Diagnosis, and Prognosis of Cancer.人工智能和机器学习在癌症预测、诊断及预后中的作用。
Cureus. 2022 Nov 2;14(11):e31008. doi: 10.7759/cureus.31008. eCollection 2022 Nov.
8
Comparison of machine learning models for predicting the risk of breast cancer-related lymphedema in Chinese women.预测中国女性乳腺癌相关淋巴水肿风险的机器学习模型比较
Asia Pac J Oncol Nurs. 2022 Jun 9;9(12):100101. doi: 10.1016/j.apjon.2022.100101. eCollection 2022 Dec.
9
The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression.类别不平衡校正对风险预测模型的危害:使用逻辑回归进行说明和模拟。
J Am Med Inform Assoc. 2022 Aug 16;29(9):1525-1534. doi: 10.1093/jamia/ocac093.
10
Comparison of the Metastasis Predictive Potential of mRNA and Long Non-Coding RNA Profiling in Systemically Untreated Breast Cancer.未经全身治疗的乳腺癌中mRNA和长链非编码RNA谱的转移预测潜力比较
Cancers (Basel). 2021 Sep 29;13(19):4907. doi: 10.3390/cancers13194907.