IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):124-130. doi: 10.1109/TCBB.2018.2829519. Epub 2018 Apr 23.
To screen differentially expressed genes quickly and efficiently in breast cancer, two gene microarray datasets of breast cancer, GSE15852 and GSE45255, were downloaded from GEO. By combining the Logistic Regression and Random Forest algorithm, this paper proposed a novel method named LR-RF to select differentially expressed genes of breast cancer on microarray data by the Bonferroni test of FWER error measure. Comparing with Logistic Regression and Random Forest, our study shows that LR-FR has a great facility in selecting differentially expressed genes. The average prediction accuracy of the proposed LR-RF from replicating random test 10 times surprisingly reaches 93.11 percent with variance as low as 0.00045. The prediction accuracy rate reaches a maximum 95.57 percent when threshold value α = 0.2 in the random forest algorithm process of ranking genes' importance score, and the differentially expressed genes are relatively few in number. In addition, through analyzing the gene interaction networks, most of the top 20 genes we selected were found to involve in the development of breast cancer. All of these results demonstrate the reliability and efficiency of LR-RF. It is anticipated that LR-RF would provide new knowledge and method for biologists, medical scientists, and cognitive computing researchers to identify disease-related genes of breast cancer.
为了快速有效地筛选乳腺癌中的差异表达基因,本研究从 GEO 下载了两个乳腺癌基因芯片数据集 GSE15852 和 GSE45255。通过结合 Logistic 回归和随机森林算法,本文提出了一种名为 LR-RF 的新方法,该方法通过 FWER 错误度量的 Bonferroni 检验来选择基因芯片数据中的乳腺癌差异表达基因。与 Logistic 回归和随机森林相比,我们的研究表明 LR-FR 在选择差异表达基因方面具有很大的优势。从重复随机测试 10 次中得出的建议 LR-RF 的平均预测准确率令人惊讶地达到了 93.11%,方差低至 0.00045。当随机森林算法中基因重要性评分排序的阈值α=0.2 时,预测准确率达到最大值 95.57%,并且差异表达基因的数量相对较少。此外,通过分析基因相互作用网络,我们发现所选择的前 20 个基因中的大多数都与乳腺癌的发展有关。所有这些结果都证明了 LR-RF 的可靠性和效率。预计 LR-RF 将为生物学家、医学科学家和认知计算研究人员提供识别乳腺癌相关基因的新知识和方法。