BMC Bioinformatics. 2013 Mar 19;14:100. doi: 10.1186/1471-2105-14-100.
Microarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann-Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression.
The DT models exhibited the lowest predictive power and the poorest extrapolation when applied to the test samples. The ANN models displayed the best predictive power and showed the best extrapolation. The 21 most-associated genes, as determined by integration of each model, were analyzed using Cox regression with a 3.53-fold (95% CI: 2.24-5.58) increased risk of breast cancer five-year recurrence.
The 21 selected genes can predict breast cancer recurrence. Among these genes, CCNB1, PLK1 and TOP2A are in the cell cycle G2/M DNA damage checkpoint pathway. Oncologists can offer the genetic information for patients when understanding the gene expression profiles on breast cancer recurrence.
微阵列技术可以同时获取数千个基因的信息。我们分析了已发表的乳腺癌微阵列数据库,以预测五年复发,并比较了人工神经网络(ANN)、决策树(DT)和逻辑回归(LR)三种数据挖掘算法以及 DT-ANN 和 DT-LR 两种组合模型的性能。从基因表达综合数据库(Gene Expression Omnibus)中收集微阵列数据集,将四个乳腺癌数据集合并用于预测五年乳腺癌复发。在数据编制后,共汇总了 757 名患者、5 个临床变量和 13452 个遗传变量。使用 bootstrap 方法、Mann-Whitney U 检验和 20 倍交叉验证,对具有 100 个最显著 p 值的候选基因进行了研究。使用准确性和 ROC 曲线下面积评估了 DT、LR 和 ANN 模型的预测能力。使用 Cox 回归评估相关基因。
当应用于测试样本时,DT 模型表现出最低的预测能力和最差的外推能力。ANN 模型显示出最佳的预测能力和最佳的外推能力。通过整合每个模型确定的 21 个最相关基因,使用 Cox 回归分析,乳腺癌五年复发的风险增加了 3.53 倍(95%CI:2.24-5.58)。
这 21 个选定的基因可以预测乳腺癌的复发。在这些基因中,CCNB1、PLK1 和 TOP2A 位于细胞周期 G2/M DNA 损伤检查点途径中。肿瘤学家可以在了解乳腺癌复发的基因表达谱时为患者提供遗传信息。