Schliep Alexander, Costa Ivan G, Steinhoff Christine, Schönhuth Alexander
Max Planck Institute for Molecular Genetics, Berlin, Germany.
IEEE/ACM Trans Comput Biol Bioinform. 2005 Jul-Sep;2(3):179-93. doi: 10.1109/TCBB.2005.31.
Measuring gene expression over time can provide important insights into basic cellular processes. Identifying groups of genes with similar expression time-courses is a crucial first step in the analysis. As biologically relevant groups frequently overlap, due to genes having several distinct roles in those cellular processes, this is a difficult problem for classical clustering methods. We use a mixture model to circumvent this principal problem, with hidden Markov models (HMMs) as effective and flexible components. We show that the ensuing estimation problem can be addressed with additional labeled data-partially supervised learning of mixtures-through a modification of the Expectation-Maximization (EM) algorithm. Good starting points for the mixture estimation are obtained through a modification to Bayesian model merging, which allows us to learn a collection of initial HMMs. We infer groups from mixtures with a simple information-theoretic decoding heuristic, which quantifies the level of ambiguity in group assignment. The effectiveness is shown with high-quality annotation data. As the HMMs we propose capture asynchronous behavior by design, the groups we find are also asynchronous. Synchronous subgroups are obtained from a novel algorithm based on Viterbi paths. We show the suitability of our HMM mixture approach on biological and simulated data and through the favorable comparison with previous approaches. A software implementing the method is freely available under the GPL from http://ghmm.org/gql.
随着时间推移测量基因表达可以为基本细胞过程提供重要见解。识别具有相似表达时间进程的基因群体是分析中的关键第一步。由于基因在那些细胞过程中具有多种不同作用,生物学相关群体经常重叠,这对于经典聚类方法来说是个难题。我们使用混合模型来规避这个主要问题,将隐马尔可夫模型(HMM)作为有效且灵活的组件。我们表明,通过对期望最大化(EM)算法进行修改,利用额外的标记数据——混合模型的部分监督学习,可以解决随之而来的估计问题。通过对贝叶斯模型合并进行修改获得混合估计的良好起始点,这使我们能够学习一组初始HMM。我们使用一种简单的信息论解码启发式方法从混合模型中推断群体,该方法量化群体分配中的模糊程度。通过高质量注释数据展示了其有效性。由于我们提出的HMM通过设计捕捉异步行为,我们找到的群体也是异步的。同步子群体通过基于维特比路径的新算法获得。我们通过与先前方法的有利比较,展示了我们的HMM混合方法在生物学数据和模拟数据上的适用性。一个实现该方法的软件可根据GPL协议从http://ghmm.org/gql免费获取。