Kawamura Takashi, Mutoh Hironori, Tomita Yasuyuki, Kato Ryuji, Honda Hiroyuki
Department of Biotechnology, School of Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Japan.
J Biosci Bioeng. 2008 Nov;106(5):442-8. doi: 10.1263/jbb.106.442.
It is well known that various genes related to cell cycle, cell-cell adhesion, and transcriptional regulation cause the onset of cancer. Moreover, environmental factors including age, sex, and lifestyle can also contribute to the onset of cancer. Therefore, it is difficult to ascertain which factors influence the onset. Thus, patients suffering from same disease can be divided into several distinct groups. In the present study, we applied graph-based clustering to several DNA microarray datasets before the classification analysis. Several clusters formed by the graph-based clustering were used for the construction of multi-class classification model with the k-nearest neighbor and for finding genes, which are specific to a certain cluster, by One vs. Others classification. Using this approach, the classification model was constructed for four microarray datasets, leukemia, breast cancer, prostate cancer, and colon cancer, and the accuracies of classification with k-nearest neighbor were all more than 80%. And in the breast cancer dataset, we succeeded in finding genes that are specific in a cluster consisting of 38 control group samples. These results indicate the importance of sample clustering before classification model construction.
众所周知,与细胞周期、细胞间黏附以及转录调控相关的各种基因会引发癌症。此外,包括年龄、性别和生活方式在内的环境因素也可能促使癌症的发生。因此,很难确定哪些因素会影响癌症的发生。如此一来,患有相同疾病的患者可被分为几个不同的组。在本研究中,我们在分类分析之前将基于图的聚类方法应用于几个DNA微阵列数据集。基于图的聚类所形成的几个簇被用于构建k近邻多类分类模型,并通过一对多分类来寻找特定于某个簇的基因。使用这种方法,针对白血病、乳腺癌、前列腺癌和结肠癌这四个微阵列数据集构建了分类模型,并且k近邻分类的准确率均超过80%。在乳腺癌数据集中,我们成功找到了在由38个对照组样本组成的一个簇中具有特异性的基因。这些结果表明了在构建分类模型之前进行样本聚类的重要性。