探讨肿瘤分类的类内和类间相关系数分布。

Exploring the within- and between-class correlation distributions for tumor classification.

机构信息

Department of Statistics, University of California, 8125 Math Sciences Building, Box 951554, Los Angeles, CA 90095-1554, USA.

出版信息

Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6737-42. doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.

DOI:10.1073/pnas.0910140107

PMID:20339085

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2872377/

Abstract

To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback-Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and k-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.

摘要

对于许多生物医学研究人员来说，有效的肿瘤分类方法，如支持向量机，不仅因为程序复杂，还因为所需的规格（如核函数的选择）在数学上或生物学上都没有明确的指导，所以看起来就像一个黑盒子。通常观察到的是，同一肿瘤类别的样本在基因表达上比不同肿瘤类别的样本更相似。但是，这种广受欢迎的观察结果能否带来有用的分类和预测程序呢？为了解决这个问题，我们首先构思了一个统计框架，并得出了一些普遍的条件，作为支持上述经验观察的理论基础。然后，我们构建了一个分类程序，该程序充分利用了通过 Kullback-Leibler 散度比较类内相关性和类间相关性分布所获得的信息。我们通过将其应用于涉及人类癌症的 22 个二分类和多分类基因表达数据集，将我们的方法与许多机器学习技术进行了比较。结果表明，我们的方法与支持向量机和朴素贝叶斯一样有效，优于其他学习方法（决策树、线性判别分析和 k-最近邻）。此外，我们进行了一项模拟研究，结果表明，如果新样本受到基线偏移或噪声水平增加等常见问题的影响，我们的方法将更加有效。当只有样本之间的相似性得分可用时，我们的方法可以扩展到一般的分类问题。

相似文献

Exploring the within- and between-class correlation distributions for tumor classification.探讨肿瘤分类的类内和类间相关系数分布。

Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6737-42. doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Simple decision rules for classifying human cancers from gene expression profiles.基于基因表达谱对人类癌症进行分类的简单决策规则。

Bioinformatics. 2005 Oct 15;21(20):3896-904. doi: 10.1093/bioinformatics/bti631. Epub 2005 Aug 16.

Multiclass molecular cancer classification by kernel subspace methods with effective kernel parameter selection.基于有效核参数选择的核子空间方法进行多类分子癌症分类

J Bioinform Comput Biol. 2005 Oct;3(5):1071-88. doi: 10.1142/s0219720005001491.

A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data.一种基于回归的K近邻算法，用于从异构数据预测基因功能。

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-7-S1-S11.

A new fuzzy support vectors machine for biomedical data classification.一种用于生物医学数据分类的新型模糊支持向量机。

Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:4676-9. doi: 10.1109/IEMBS.2008.4650256.

Bagging linear sparse Bayesian learning models for variable selection in cancer diagnosis.用于癌症诊断中变量选择的袋装线性稀疏贝叶斯学习模型

IEEE Trans Inf Technol Biomed. 2007 May;11(3):338-47. doi: 10.1109/titb.2006.889702.

Improving cancer classification accuracy using gene pairs.利用基因对提高癌症分类准确性。

PLoS One. 2010 Dec 21;5(12):e14305. doi: 10.1371/journal.pone.0014305.

A fast gene selection method for multi-cancer classification using multiple support vector data description.一种基于多支持向量数据描述的多癌症分类快速基因选择方法。

J Biomed Inform. 2015 Feb;53:381-9. doi: 10.1016/j.jbi.2014.12.009. Epub 2014 Dec 27.

Reliable classification of two-class cancer data using evolutionary algorithms.使用进化算法对两类癌症数据进行可靠分类。

Biosystems. 2003 Nov;72(1-2):111-29. doi: 10.1016/s0303-2647(03)00138-2.

引用本文的文献

Comparative Analysis of Multi-Omics Integration Using Graph Neural Networks for Cancer Classification.使用图神经网络进行癌症分类的多组学整合的比较分析

IEEE Access. 2025;13:37724-37736. doi: 10.1109/access.2025.3540769. Epub 2025 Feb 11.

Robust classification using average correlations as features (ACF).基于平均相关系数的稳健分类（ACF）。

BMC Bioinformatics. 2023 Mar 20;24(1):101. doi: 10.1186/s12859-023-05224-0.

Differentiating the learning styles of college students in different disciplines in a college English blended learning setting.在大学英语混合学习环境中区分不同学科大学生的学习风格。

PLoS One. 2021 May 20;16(5):e0251545. doi: 10.1371/journal.pone.0251545. eCollection 2021.

Comparing biological information contained in mRNA and non-coding RNAs for classification of lung cancer patients.比较 mRNA 和非编码 RNA 中包含的生物学信息，以对肺癌患者进行分类。

BMC Cancer. 2019 Dec 3;19(1):1176. doi: 10.1186/s12885-019-6338-1.

Investigating microRNA-target interaction-supported tissues in human cancer tissues based on miRNA and target gene expression profiling.基于 miRNA 和靶基因表达谱分析研究人类癌症组织中的 miRNA-靶相互作用支持组织。

PLoS One. 2014 Apr 22;9(4):e95697. doi: 10.1371/journal.pone.0095697. eCollection 2014.

Knowledge discovery by accuracy maximization.通过最大化准确性进行知识发现。

Proc Natl Acad Sci U S A. 2014 Apr 8;111(14):5117-22. doi: 10.1073/pnas.1220873111. Epub 2014 Mar 24.

100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox.100% 的分类准确率可能有害：归一化信息传递因子解释了准确率悖论。

PLoS One. 2014 Jan 10;9(1):e84217. doi: 10.1371/journal.pone.0084217. eCollection 2014.

TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection.TSG：一种用于二分类和多分类癌症分类及信息基因选择的新算法。

BMC Med Genomics. 2013;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1755-8794-6-S1-S3. Epub 2013 Jan 23.

Metagenomic biomarker discovery and explanation.宏基因组生物标志物发现与阐释。

Genome Biol. 2011 Jun 24;12(6):R60. doi: 10.1186/gb-2011-12-6-r60.

本文引用的文献

Revealing targeted therapy for human cancer by gene module maps.通过基因模块图谱揭示人类癌症的靶向治疗方法。

Cancer Res. 2008 Jan 15;68(2):369-78. doi: 10.1158/0008-5472.CAN-07-0382.

Ensemble dependence model for classification and prediction of cancer and normal gene expression data.用于癌症和正常基因表达数据分类与预测的集成依赖模型。

Bioinformatics. 2005 Jul 15;21(14):3114-21. doi: 10.1093/bioinformatics/bti483. Epub 2005 May 6.

In silico dissection of cell-type-associated patterns of gene expression in prostate cancer.前列腺癌中细胞类型相关基因表达模式的计算机剖析

Proc Natl Acad Sci U S A. 2004 Jan 13;101(2):615-20. doi: 10.1073/pnas.2536479100.

Gene expression profiling identifies clinically relevant subtypes of prostate cancer.基因表达谱分析可识别前列腺癌的临床相关亚型。

Proc Natl Acad Sci U S A. 2004 Jan 20;101(3):811-6. doi: 10.1073/pnas.0304146101. Epub 2004 Jan 7.

Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.通过基因表达谱分析对儿童急性淋巴细胞白血病进行分类、亚型发现及预后预测。

Cancer Cell. 2002 Mar;1(2):133-43. doi: 10.1016/s1535-6108(02)00032-6.

Molecular profiling of non-small cell lung cancer and correlation with disease-free survival.非小细胞肺癌的分子特征分析及其与无病生存期的相关性。

Cancer Res. 2002 Jun 1;62(11):3005-8.

Prediction of central nervous system embryonal tumour outcome based on gene expression.基于基因表达的中枢神经系统胚胎性肿瘤预后预测

Nature. 2002 Jan 24;415(6870):436-42. doi: 10.1038/415436a.

Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.通过基因表达谱分析和监督式机器学习预测弥漫性大B细胞淋巴瘤的预后

Nat Med. 2002 Jan;8(1):68-74. doi: 10.1038/nm0102-68.

MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia.MLL易位确定了一种独特的基因表达谱，该谱区分出一种独特的白血病。

Nat Genet. 2002 Jan;30(1):41-7. doi: 10.1038/ng765. Epub 2001 Dec 3.

Diversity of gene expression in adenocarcinoma of the lung.肺腺癌中基因表达的多样性。

Proc Natl Acad Sci U S A. 2001 Nov 20;98(24):13784-9. doi: 10.1073/pnas.241500798. Epub 2001 Nov 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验