Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G
Machine Learning Group, Department of Computer Science, Institut Jules Bordet, Université Libre de Bruxelles, Brussels, Belgium.
Bioinformatics. 2008 Oct 1;24(19):2200-8. doi: 10.1093/bioinformatics/btn374. Epub 2008 Jul 17.
Survival prediction of breast cancer (BC) patients independently of treatment, also known as prognostication, is a complex task since clinically similar breast tumors, in addition to be molecularly heterogeneous, may exhibit different clinical outcomes. In recent years, the analysis of gene expression profiles by means of sophisticated data mining tools emerged as a promising technology to bring additional insights into BC biology and to improve the quality of prognostication. The aim of this work is to assess quantitatively the accuracy of prediction obtained with state-of-the-art data analysis techniques for BC microarray data through an independent and thorough framework.
Due to the large number of variables, the reduced amount of samples and the high degree of noise, complex prediction methods are highly exposed to performance degradation despite the use of cross-validation techniques. Our analysis shows that the most complex methods are not significantly better than the simplest one, a univariate model relying on a single proliferation gene. This result suggests that proliferation might be the most relevant biological process for BC prognostication and that the loss of interpretability deriving from the use of overcomplex methods may be not sufficiently counterbalanced by an improvement of the quality of prediction.
The comparison study is implemented in an R package called survcomp and is available from http://www.ulb.ac.be/di/map/bhaibeka/software/survcomp/.
独立于治疗手段对乳腺癌(BC)患者进行生存预测,即预后判断,是一项复杂的任务,因为临床上相似的乳腺肿瘤除了分子层面具有异质性外,还可能表现出不同的临床结果。近年来,借助先进的数据挖掘工具分析基因表达谱,成为一种很有前景的技术,可为乳腺癌生物学带来更多见解,并提高预后判断的质量。这项工作的目的是通过一个独立且全面的框架,定量评估使用先进数据分析技术对乳腺癌微阵列数据进行预测的准确性。
由于变量数量众多、样本量减少以及噪声程度高,尽管使用了交叉验证技术,复杂的预测方法仍极易出现性能下降的情况。我们的分析表明,最复杂的方法并不比最简单的方法(即依赖单个增殖基因的单变量模型)有显著优势。这一结果表明,增殖可能是乳腺癌预后判断中最相关的生物学过程,而且使用过于复杂的方法导致的可解释性丧失,可能无法通过预测质量的提高得到充分弥补。
比较研究在一个名为survcomp的R包中实现,可从http://www.ulb.ac.be/di/map/bhaibeka/software/survcomp/获取。