Dai Hongying, Charnigo Richard
Research Development and Clinical Investigation, Children's Mercy Hospital, Kansas City, MO 64108, USA and Department of Biomedical & Health Informatics, University of Missouri-Kansas City, Kansas City, MO 64110, USA
Department of Statistics, University of Kentucky, Lexington, KY 40506, USA.
Biostatistics. 2015 Oct;16(4):641-54. doi: 10.1093/biostatistics/kxv016. Epub 2015 May 11.
Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, [Formula: see text], is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with [Formula: see text] brings more flexibility in explaining correlation variations among genetic variables. Expectation-Maximization (EM) algorithm and Stochastic Expectation-Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor-DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3: , e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK-STAT, MAPK and other two pathways.
对相关结构进行建模是生物信息学中的一项挑战,尤其是在处理高通量基因组数据时。提出了一种具有可交换相关结构的复合层次相关贝塔混合模型(CBM),用于将遗传向量聚类为混合成分。相关系数[公式:见原文]在一个混合成分内是同质的,而在混合成分之间是异质的。具有[公式:见原文]的随机CBM在解释遗传变量之间的相关变化方面具有更大的灵活性。期望最大化(EM)算法和随机期望最大化(SEM)算法用于估计CBM的参数。混合成分的数量可以使用诸如AIC、BIC和ICL - BIC等模型选择标准来确定。进行了广泛的模拟研究以比较EM、SEM和模型选择标准。模拟结果表明,CBM的性能优于传统的贝塔混合模型,具有更低的估计偏差和更高的分类准确率。所提出的方法应用于对Lahdesmaki等人(2008年,《从多个数据源进行转录因子结合的概率推断》,《公共科学图书馆·综合》,第叁卷,第,e1820)生成的小鼠基因组数据中的转录因子 - DNA结合概率进行聚类。结果揭示了转录因子在与JAK - STAT、MAPK和其他两条途径中的基因启动子区域结合时的不同聚类。