Wu Xiao-yan, Wu Zhen-yu, Li Kang
Department of Biostatistics, College of Public Health, Harbin Medical University, Harbin, Heilongjiang, China.
Chin Med J (Engl). 2008 Dec 20;121(24):2492-6.
The major difficulty in the research of DNA microarray data is the large number of genes compared with the relatively small number of samples as well as the complex data structure. Random forest has received much attention recently; its primary characteristic is that it can form a classification model from the data with high dimensionality. However, optimal results can not be obtained for gene selection since it is still affected by undifferentiated genes. We proposed recursive random forest analysis and applied it to gene selection.
Recursive random forest, which is an improvement of random forest, obtains optimal differentiated genes after step by step dropping of genes which, according to a certain algorithm, have no effects on classification. The method has the advantage of random forest and provides a gene importance scale as well. The value of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which synthesizes the information of sensitivity and specificity, is adopted as the key standard for evaluating the performance of this method. The focus of the paper is to validate the effectiveness of gene selection using recursive random forest through the analysis of five microarray datasets; colon, prostate, leukemia, breast and skin data.
Five microarray datasets were analyzed and better classification results have been attained using only a few genes after gene selection. The biological information of the selected genes from breast and skin data was confirmed according to the National Center for Biotechnology Information (NCBI). The results prove that the genes associated with diseases can be effectively retained by recursive random forest.
Recursive random forest can be effectively applied to microarray data analysis and gene selection. The retained genes in the optimal model provide important information for clinical diagnoses and research of the biological mechanism of diseases.
DNA微阵列数据研究的主要困难在于基因数量众多,而样本数量相对较少,以及数据结构复杂。随机森林最近受到了广泛关注;其主要特点是能够从高维数据中形成分类模型。然而,由于它仍然受到未分化基因的影响,在基因选择方面无法获得最佳结果。我们提出了递归随机森林分析并将其应用于基因选择。
递归随机森林是对随机森林的改进,通过根据一定算法逐步剔除对分类无影响的基因,从而获得最佳的分化基因。该方法具有随机森林的优点,还提供了基因重要性量表。采用综合了敏感性和特异性信息的受试者工作特征(ROC)曲线下面积(AUC)值作为评估该方法性能的关键标准。本文的重点是通过对五个微阵列数据集(结肠、前列腺、白血病、乳腺和皮肤数据)的分析来验证使用递归随机森林进行基因选择的有效性。
对五个微阵列数据集进行了分析,基因选择后仅使用少数基因就获得了更好的分类结果。根据美国国立生物技术信息中心(NCBI)确认了从乳腺和皮肤数据中选择的基因的生物学信息。结果证明,递归随机森林能够有效保留与疾病相关的基因。
递归随机森林可有效地应用于微阵列数据分析和基因选择。最优模型中保留的基因可为临床诊断和疾病生物学机制研究提供重要信息。