Suppr超能文献

使用监督学习组合帕累托最优聚类以识别共表达基因。

Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.

作者信息

Maulik Ujjwal, Mukhopadhyay Anirban, Bandyopadhyay Sanghamitra

机构信息

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India.

出版信息

BMC Bioinformatics. 2009 Jan 20;10:27. doi: 10.1186/1471-2105-10-27.

Abstract

BACKGROUND

The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have been actively utilized in order to identify groups of co-expressed genes. This article poses the problem of fuzzy clustering in microarray data as a multiobjective optimization problem which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. Each of these clustering solutions possesses some amount of information regarding the clustering structure of the input data. Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set. This approach first identifies the genes which are assigned to some particular cluster with high membership degree by most of the Pareto-optimal solutions. Using this set of genes as the training set, the remaining genes are classified by a supervised learning algorithm. In this work, we have used a Support Vector Machine (SVM) classifier for this purpose.

RESULTS

The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes.

CONCLUSION

The proposed clustering method has been shown to perform better than other well-known clustering algorithms in finding clusters of co-expressed genes efficiently. The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups. This indicates that the proposed clustering method can be used efficiently to identify co-expressed genes in microarray gene expression data.Supplementary Website The pre-processed and normalized data sets, the matlab code and other related materials are available at http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

摘要

背景

随着微阵列技术的发明,生物学和生物医学研究的格局正在迅速改变,微阵列技术能够同时观察大量基因在不同实验条件或时间点的转录水平。利用微阵列数据集,聚类算法已被积极用于识别共表达基因的组。本文将微阵列数据中的模糊聚类问题作为一个多目标优化问题提出,该问题同时优化两个内部模糊聚类有效性指标,以产生一组帕累托最优聚类解。这些聚类解中的每一个都拥有关于输入数据聚类结构的一定量信息。受这一事实的启发,提出了一种新颖的模糊多数投票方法,以组合来自所得帕累托最优集中所有解的聚类信息。该方法首先识别出大多数帕累托最优解以高隶属度分配到某个特定聚类的基因。将这组基因用作训练集,其余基因通过监督学习算法进行分类。在这项工作中,我们为此使用了支持向量机(SVM)分类器。

结果

在五个公开可用的基准微阵列数据集上展示了所提出聚类技术的性能,即酵母孢子形成、酵母细胞周期、拟南芥、人成纤维细胞血清和大鼠中枢神经系统。报告了使用不同SVM核以及几种广泛使用的微阵列聚类技术的比较研究。此外,进行了统计显著性检验以确立所提出聚类方法的统计优势。最后,使用基于网络的基因注释工具进行了生物学显著性检验,以表明所提出的方法能够产生生物学相关的共表达基因聚类。

结论

所提出的聚类方法在有效找到共表达基因聚类方面已被证明比其他知名聚类算法表现更好。所提出技术产生的基因聚类也被发现具有生物学意义,即由属于相同功能组的基因组成。这表明所提出的聚类方法可有效地用于识别微阵列基因表达数据中的共表达基因。补充网站 预处理和归一化的数据集、Matlab代码及其他相关材料可在http://anirbanmukhopadhyay.50webs.com/mogasvm.html获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d0a2/2657792/137ef10eef41/1471-2105-10-27-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验