Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1422-31. doi: 10.1109/TCBB.2012.63.
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
尽管已经开发出许多用于分类的特征选择方法,但仍需要识别出具有删失生存结局的高维数据中的基因。传统的分类问题基因选择方法存在几个缺点。首先,大多数分类基因选择方法都是基于单基因的。其次,许多基因选择过程并没有嵌入到算法本身中。随机森林技术已被发现可在具有生存结局的高维数据环境中表现良好。它还有一个嵌入式功能来识别重要变量。因此,它是高维数据中具有生存结局的基因选择的理想候选者。在本文中,我们基于随机森林开发了一种新的方法来识别一组预后基因。我们使用几个真实数据集将我们的方法与几种机器学习方法和各种节点分裂标准进行了比较。我们的方法在模拟和真实数据分析中都表现良好。此外,我们还展示了我们的方法相对于基于单基因的方法的优势。我们的方法将生存结局的微阵列数据中的多变量相关性纳入其中。所描述的方法允许我们更好地利用具有生存结局的微阵列数据中的可用信息。