Moon Myungjin, Nakai Kenta
* Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-Shi, Chiba-Ken 277-8562, Japan.
† Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-Ku, Tokyo 108-8639, Japan.
J Bioinform Comput Biol. 2018 Apr;16(2):1850006. doi: 10.1142/S0219720018500063. Epub 2018 Feb 22.
Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.
目前,癌症生物标志物的发现是全球重要的研究课题之一。特别是,检测与癌症相关的重要基因是癌症早期诊断和治疗的一项重要任务。传统研究大多集中在癌症不同状态下差异表达的基因;然而,基因表达数据集中的噪声以及有限数据集中信息的不足阻碍了对新型候选生物标志物的精确分析。在本研究中,我们提出利用归一化和无监督特征提取对基因表达和DNA甲基化进行综合分析,以使用肾细胞癌RNA测序数据集识别癌症的候选生物标志物。基因表达和DNA甲基化数据集通过Box-Cox变换进行归一化,并通过无监督特征提取方法整合到一个保留原始数据集主要特征的一维数据集中,然后从整合数据集中选择差异表达基因。与仅使用基因表达或DNA甲基化数据集的传统方法相比,使用整合数据集显示出更好的性能。基于文献的验证表明,整合数据集中相当数量排名靠前的基因与癌症存在已知关系,这意味着也可以从所提出的分析方法中获得新型候选生物标志物。此外,我们期望所提出的方法能够扩展应用于涉及各种类型多组学数据集的研究中。