使用递归随机森林识别微阵列数据中的差异基因表达

Wu Xiao-yan, Wu Zhen-yu, Li Kang

Department of Biostatistics, College of Public Health, Harbin Medical University, Harbin, Heilongjiang, China.

Chin Med J (Engl). 2008 Dec 20;121(24):2492-6.

BACKGROUND

The major difficulty in the research of DNA microarray data is the large number of genes compared with the relatively small number of samples as well as the complex data structure. Random forest has received much attention recently; its primary characteristic is that it can form a classification model from the data with high dimensionality. However, optimal results can not be obtained for gene selection since it is still affected by undifferentiated genes. We proposed recursive random forest analysis and applied it to gene selection.

METHODS

Recursive random forest, which is an improvement of random forest, obtains optimal differentiated genes after step by step dropping of genes which, according to a certain algorithm, have no effects on classification. The method has the advantage of random forest and provides a gene importance scale as well. The value of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which synthesizes the information of sensitivity and specificity, is adopted as the key standard for evaluating the performance of this method. The focus of the paper is to validate the effectiveness of gene selection using recursive random forest through the analysis of five microarray datasets; colon, prostate, leukemia, breast and skin data.

RESULTS

Five microarray datasets were analyzed and better classification results have been attained using only a few genes after gene selection. The biological information of the selected genes from breast and skin data was confirmed according to the National Center for Biotechnology Information (NCBI). The results prove that the genes associated with diseases can be effectively retained by recursive random forest.

CONCLUSIONS

Recursive random forest can be effectively applied to microarray data analysis and gene selection. The retained genes in the optimal model provide important information for clinical diagnoses and research of the biological mechanism of diseases.

背景

DNA微阵列数据研究的主要困难在于基因数量众多，而样本数量相对较少，以及数据结构复杂。随机森林最近受到了广泛关注；其主要特点是能够从高维数据中形成分类模型。然而，由于它仍然受到未分化基因的影响，在基因选择方面无法获得最佳结果。我们提出了递归随机森林分析并将其应用于基因选择。

方法

递归随机森林是对随机森林的改进，通过根据一定算法逐步剔除对分类无影响的基因，从而获得最佳的分化基因。该方法具有随机森林的优点，还提供了基因重要性量表。采用综合了敏感性和特异性信息的受试者工作特征（ROC）曲线下面积（AUC）值作为评估该方法性能的关键标准。本文的重点是通过对五个微阵列数据集（结肠、前列腺、白血病、乳腺和皮肤数据）的分析来验证使用递归随机森林进行基因选择的有效性。

结果

对五个微阵列数据集进行了分析，基因选择后仅使用少数基因就获得了更好的分类结果。根据美国国立生物技术信息中心（NCBI）确认了从乳腺和皮肤数据中选择的基因的生物学信息。结果证明，递归随机森林能够有效保留与疾病相关的基因。

结论

递归随机森林可有效地应用于微阵列数据分析和基因选择。最优模型中保留的基因可为临床诊断和疾病生物学机制研究提供重要信息。

相似文献

Identification of differential gene expression for microarray data using recursive random forest.

Chin Med J (Engl). 2008 Dec 20;121(24):2492-6.

Tumor classification ranking from microarray data.

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S21. doi: 10.1186/1471-2164-9-S2-S21.

Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE.

BMC Bioinformatics. 2006 Dec 25;7:543. doi: 10.1186/1471-2105-7-543.

Reliable gene signatures for microarray classification: assessment of stability and performance.

Bioinformatics. 2006 Oct 1;22(19):2356-63. doi: 10.1093/bioinformatics/btl400. Epub 2006 Jul 31.

An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer.

Artif Intell Med. 2008 Jan;42(1):81-93. doi: 10.1016/j.artmed.2007.09.004. Epub 2007 Nov 19.

Ensemble gene selection by grouping for microarray data classification.

J Biomed Inform. 2010 Feb;43(1):81-7. doi: 10.1016/j.jbi.2009.08.010. Epub 2009 Aug 20.

Filter versus wrapper gene selection approaches in DNA microarray domains.

Artif Intell Med. 2004 Jun;31(2):91-103. doi: 10.1016/j.artmed.2004.01.007.

Differential gene expression detection and sample classification using penalized linear regression models.

Bioinformatics. 2006 Feb 15;22(4):472-6. doi: 10.1093/bioinformatics/bti827. Epub 2005 Dec 13.

Limits of predictive models using microarray data for breast cancer clinical treatment outcome.

J Natl Cancer Inst. 2005 Jun 15;97(12):927-30. doi: 10.1093/jnci/dji153.

Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers.

Biosystems. 2007 Jul-Aug;90(1):78-86. doi: 10.1016/j.biosystems.2006.07.002. Epub 2006 Jul 10.

引用本文的文献

Machine learning-based classification and diagnosis of clinical cardiomyopathies.

Physiol Genomics. 2020 Sep 1;52(9):391-400. doi: 10.1152/physiolgenomics.00063.2020. Epub 2020 Aug 3.

Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection.

Evid Based Complement Alternat Med. 2013;2013:298183. doi: 10.1155/2013/298183. Epub 2013 Feb 2.

Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy.

BMC Bioinformatics. 2009 Oct 29;10:361. doi: 10.1186/1471-2105-10-361.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Identification of differential gene expression for microarray data using recursive random forest.

Chin Med J (Engl). 2008 Dec 20;121(24):2492-6.

Tumor classification ranking from microarray data.

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S21. doi: 10.1186/1471-2164-9-S2-S21.

Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE.

BMC Bioinformatics. 2006 Dec 25;7:543. doi: 10.1186/1471-2105-7-543.

Reliable gene signatures for microarray classification: assessment of stability and performance.

Bioinformatics. 2006 Oct 1;22(19):2356-63. doi: 10.1093/bioinformatics/btl400. Epub 2006 Jul 31.

An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer.

Artif Intell Med. 2008 Jan;42(1):81-93. doi: 10.1016/j.artmed.2007.09.004. Epub 2007 Nov 19.

Ensemble gene selection by grouping for microarray data classification.

J Biomed Inform. 2010 Feb;43(1):81-7. doi: 10.1016/j.jbi.2009.08.010. Epub 2009 Aug 20.

Filter versus wrapper gene selection approaches in DNA microarray domains.

Artif Intell Med. 2004 Jun;31(2):91-103. doi: 10.1016/j.artmed.2004.01.007.

Differential gene expression detection and sample classification using penalized linear regression models.

Bioinformatics. 2006 Feb 15;22(4):472-6. doi: 10.1093/bioinformatics/bti827. Epub 2005 Dec 13.

Limits of predictive models using microarray data for breast cancer clinical treatment outcome.

J Natl Cancer Inst. 2005 Jun 15;97(12):927-30. doi: 10.1093/jnci/dji153.

Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers.

Biosystems. 2007 Jul-Aug;90(1):78-86. doi: 10.1016/j.biosystems.2006.07.002. Epub 2006 Jul 10.

引用本文的文献

Machine learning-based classification and diagnosis of clinical cardiomyopathies.

Physiol Genomics. 2020 Sep 1;52(9):391-400. doi: 10.1152/physiolgenomics.00063.2020. Epub 2020 Aug 3.

Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection.

Evid Based Complement Alternat Med. 2013;2013:298183. doi: 10.1155/2013/298183. Epub 2013 Feb 2.

Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy.

BMC Bioinformatics. 2009 Oct 29;10:361. doi: 10.1186/1471-2105-10-361.

Identification of differential gene expression for microarray data using recursive random forest.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献