Wilhelm Thomas
Theoretical Systems Biology, Institute of Food Research, Norwich Research Park, Norwich NR4 7UA, UK.
BMC Bioinformatics. 2014 Jun 17;15:193. doi: 10.1186/1471-2105-15-193.
DNA methylation (DNAm) has important regulatory roles in many biological processes and diseases. It is the only epigenetic mark with a clear mechanism of mitotic inheritance and the only one easily available on a genome scale. Aberrant cytosine-phosphate-guanine (CpG) methylation has been discussed in the context of disease aetiology, especially cancer. CpG hypermethylation of promoter regions is often associated with silencing of tumour suppressor genes and hypomethylation with activation of oncogenes.Supervised principal component analysis (SPCA) is a popular machine learning method. However, in a recent application to phenotype prediction from DNAm data SPCA was inferior to the specific method EVORA.
We present Model-Selection-SPCA (MS-SPCA), an enhanced version of SPCA. MS-SPCA applies several models that perform well in the training data to the test data and selects the very best models for final prediction based on parameters of the test data.We have applied MS-SPCA for phenotype prediction from genome-wide DNAm data. CpGs used for prediction are selected based on the quantification of three features of their methylation (average methylation difference, methylation variation difference and methylation-age-correlation). We analysed four independent case-control datasets that correspond to different stages of cervical cancer: (i) cases currently cytologically normal, but will later develop neoplastic transformations, (ii, iii) cases showing neoplastic transformations and (iv) cases with confirmed cancer. The first dataset was split into several smaller case-control datasets (samples either Human Papilloma Virus (HPV) positive or negative). We demonstrate that cytology normal HPV+ and HPV- samples contain DNAm patterns which are associated with later neoplastic transformations. We present evidence that DNAm patterns exist in cytology normal HPV- samples that (i) predispose to neoplastic transformations after HPV infection and (ii) predispose to HPV infection itself. MS-SPCA performs significantly better than EVORA.
MS-SPCA can be applied to many classification problems. Additional improvements could include usage of more than one principal component (PC), with automatic selection of the optimal number of PCs. We expect that MS-SPCA will be useful for analysing recent larger DNAm data to predict future neoplastic transformations.
DNA甲基化(DNAm)在许多生物学过程和疾病中具有重要的调控作用。它是唯一具有明确有丝分裂遗传机制的表观遗传标记,也是唯一能在全基因组范围内轻易获取的标记。异常的胞嘧啶-磷酸-鸟嘌呤(CpG)甲基化已在疾病病因学背景下进行了讨论,尤其是在癌症方面。启动子区域的CpG高甲基化通常与肿瘤抑制基因的沉默相关,而低甲基化与癌基因的激活相关。监督主成分分析(SPCA)是一种流行的机器学习方法。然而,在最近一项从DNAm数据预测表型的应用中,SPCA不如特定方法EVORA。
我们提出了模型选择SPCA(MS-SPCA),这是SPCA的增强版本。MS-SPCA将在训练数据中表现良好的多个模型应用于测试数据,并根据测试数据的参数选择最佳模型进行最终预测。我们已将MS-SPCA应用于从全基因组DNAm数据预测表型。用于预测的CpG是根据其甲基化的三个特征(平均甲基化差异、甲基化变异差异和甲基化年龄相关性)的量化来选择的。我们分析了四个独立的病例对照数据集,它们对应于宫颈癌的不同阶段:(i)目前细胞学正常但随后会发生肿瘤转化的病例,(ii、iii)显示肿瘤转化的病例,以及(iv)确诊癌症的病例。第一个数据集被拆分为几个较小的病例对照数据集(样本为人乳头瘤病毒(HPV)阳性或阴性)。我们证明,细胞学正常的HPV+和HPV-样本包含与随后肿瘤转化相关的DNAm模式。我们提供证据表明,在细胞学正常的HPV-样本中存在DNAm模式,这些模式(i)在HPV感染后易发生肿瘤转化,(ii)本身易发生HPV感染。MS-SPCA的表现明显优于EVORA。
MS-SPCA可应用于许多分类问题。进一步的改进可能包括使用多个主成分(PC),并自动选择最佳的PC数量。我们预计MS-SPCA将有助于分析近期更大规模的DNAm数据,以预测未来的肿瘤转化。