School of Statistics, Southwestern University of Finance and Economics, Chengdu, Sichuan 611130, China, Department of Statistics, Iowa State University, Ames, IA 50011, USA, Institute of Tropical Biosciences and Biotechnology (ITBB), Chinese Academy of Tropical Agriculture Sciences (CATAS), Haikou, Hainan 571101, China and Enterprise Institute for Renewable Fuels, Donald Danforth Plant Science Center, St. Louis, MO 63132, USA.
Bioinformatics. 2014 Jan 15;30(2):197-205. doi: 10.1093/bioinformatics/btt632. Epub 2013 Nov 4.
RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis.
In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models.
An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org
RNA-seq 技术已被广泛采用,作为一种有吸引力的替代基于微阵列的方法来研究全局基因表达。然而,分析这些复杂数据集的稳健统计工具仍然缺乏。通过将具有相似表达谱的基因分组,聚类分析提供了对基因功能和网络的深入了解,因此是 RNA-seq 数据分析的重要技术。
在本文中,我们推导出基于 RNA-seq 数据适当概率模型的聚类算法。描述了一种期望最大化算法和另外两种基于随机的期望最大化算法的变体。此外,还提出了一种基于似然的初始化策略,以改进聚类算法。此外,我们提出了一种基于模型的混合层次聚类方法,生成树状结构,允许可视化聚类之间的关系以及灵活选择聚类的数量。来自模拟研究和玉米 RNA-seq 数据集的分析结果表明,与替代方法(如 K-means 算法和不基于概率模型的层次聚类方法)相比,我们提出的方法提供了更好的聚类结果。
已经开发了一个 R 包 MBCluster.Seq 来实现我们提出的算法。这个 R 包提供了快速的计算,并且可以在 http://www.r-project.org 上公开获得。