Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
Bioinformatics. 2011 Nov 1;27(21):3017-23. doi: 10.1093/bioinformatics/btr502. Epub 2011 Sep 4.
Gene expression profiling has shown great potential in outcome prediction for different types of cancers. Nevertheless, small sample size remains a bottleneck in obtaining robust and accurate classifiers. Traditional supervised learning techniques can only work with labeled data. Consequently, a large number of microarray data that do not have sufficient follow-up information are disregarded. To fully leverage all of the precious data in public databases, we turned to a semi-supervised learning technique, low density separation (LDS).
Using a clinically important question of predicting recurrence risk in colorectal cancer patients, we demonstrated that (i) semi-supervised classification improved prediction accuracy as compared with the state of the art supervised method SVM, (ii) performance gain increased with the number of unlabeled samples, (iii) unlabeled data from different institutes could be employed after appropriate processing and (iv) the LDS method is robust with regard to the number of input features. To test the general applicability of this semi-supervised method, we further applied LDS on human breast cancer datasets and also observed superior performance. Our results demonstrated great potential of semi-supervised learning in gene expression-based outcome prediction for cancer patients.
Supplementary data are available at Bioinformatics online.
基因表达谱分析在不同类型癌症的预后预测方面显示出巨大的潜力。然而,小样本量仍然是获得稳健和准确分类器的瓶颈。传统的监督学习技术只能处理标记数据。因此,大量没有足够随访信息的微阵列数据被忽略了。为了充分利用公共数据库中的所有宝贵数据,我们转向了一种半监督学习技术,低密度分离(LDS)。
我们使用一个临床重要的问题,即预测结直肠癌患者的复发风险,证明了(i)半监督分类与最先进的监督方法 SVM 相比提高了预测准确性,(ii)性能增益随着未标记样本数量的增加而增加,(iii)经过适当处理后,可以使用来自不同机构的未标记数据,以及(iv)LDS 方法对于输入特征的数量具有鲁棒性。为了测试这种半监督方法的通用性,我们进一步将 LDS 应用于人类乳腺癌数据集,也观察到了优越的性能。我们的结果表明,半监督学习在癌症患者基于基因表达的预后预测方面具有巨大的潜力。
补充数据可在 Bioinformatics 在线获得。