Hira Zena M, Gillies Duncan F
Department of Computing, Imperial College London, London, UK.
Cancer Inform. 2016 Sep 20;15:189-98. doi: 10.4137/CIN.S39859. eCollection 2016.
In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.
为了提供最有效的癌症治疗方法,能够诊断患者的癌症是否会对提议的治疗产生反应非常重要。甲基化谱分析可能包含可用于进行此类预测的信息。目前,假设检验用于确定癌症进展的可能生物标志物是否产生具有统计学意义的结果。然而,这种方法需要将单个基因或基因集识别为候选假设,并且随着现代微阵列规模的不断增大,这项任务变得越来越困难。对小基因集进行详尽测试在计算上是不可行的,因此假设生成要么依赖于已有的生物学知识,要么依赖于启发式方法。作为一种替代方法,机器学习方法可用于识别在癌症数据集中共同起作用的基因组,并将它们的行为与癌症进展相关联。这些方法具有多变量且无偏的优点,但不幸的是,随着基因探针和数据集数量的增加,它们在计算上也很快变得不可行。为了解决这个问题,我们研究了一种利用先验知识对微阵列数据集进行分割的方法,以便可以使用机器学习来识别用于假设检验的候选基因集。一个甲基化数据集被分成多个子集,每个子集只包含与已知基因通路相关的探针。这些通路子集中的每一个都独立用于分类。分类方法是使用决策树作为弱分类器的AdaBoost算法。由于每个通路子集包含相对较少数量的基因探针,因此可以快速训练和测试其分类准确性,并确定它是否具有有价值的诊断信息。最后,可以将成功通路子集中的基因组合起来创建一个高精度的分类器。