用于聚类小鼠转录因子DNA结合数据的复合层次相关贝塔混合模型

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data.

作者信息

Dai Hongying, Charnigo Richard

机构信息

Research Development and Clinical Investigation, Children's Mercy Hospital, Kansas City, MO 64108, USA and Department of Biomedical & Health Informatics, University of Missouri-Kansas City, Kansas City, MO 64110, USA

Department of Statistics, University of Kentucky, Lexington, KY 40506, USA.

出版信息

Biostatistics. 2015 Oct;16(4):641-54. doi: 10.1093/biostatistics/kxv016. Epub 2015 May 11.

DOI:10.1093/biostatistics/kxv016

PMID:25964663

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4701176/

Abstract

Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, [Formula: see text], is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with [Formula: see text] brings more flexibility in explaining correlation variations among genetic variables. Expectation-Maximization (EM) algorithm and Stochastic Expectation-Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor-DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3: , e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK-STAT, MAPK and other two pathways.

摘要

对相关结构进行建模是生物信息学中的一项挑战，尤其是在处理高通量基因组数据时。提出了一种具有可交换相关结构的复合层次相关贝塔混合模型（CBM），用于将遗传向量聚类为混合成分。相关系数[公式：见原文]在一个混合成分内是同质的，而在混合成分之间是异质的。具有[公式：见原文]的随机CBM在解释遗传变量之间的相关变化方面具有更大的灵活性。期望最大化（EM）算法和随机期望最大化（SEM）算法用于估计CBM的参数。混合成分的数量可以使用诸如AIC、BIC和ICL - BIC等模型选择标准来确定。进行了广泛的模拟研究以比较EM、SEM和模型选择标准。模拟结果表明，CBM的性能优于传统的贝塔混合模型，具有更低的估计偏差和更高的分类准确率。所提出的方法应用于对Lahdesmaki等人（2008年，《从多个数据源进行转录因子结合的概率推断》，《公共科学图书馆·综合》，第叁卷，第，e1820）生成的小鼠基因组数据中的转录因子 - DNA结合概率进行聚类。结果揭示了转录因子在与JAK - STAT、MAPK和其他两条途径中的基因启动子区域结合时的不同聚类。

相似文献

Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data.用于聚类小鼠转录因子DNA结合数据的复合层次相关贝塔混合模型

Biostatistics. 2015 Oct;16(4):641-54. doi: 10.1093/biostatistics/kxv016. Epub 2015 May 11.

A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data.一种用于对来自独立高斯分布和贝塔分布数据的基因进行聚类的联合有限混合模型。

BMC Bioinformatics. 2009 May 29;10:165. doi: 10.1186/1471-2105-10-165.

Epitope profiling via mixture modeling of ranked data.通过排序数据的混合模型进行表位分析。

Stat Med. 2014 Sep 20;33(21):3738-58. doi: 10.1002/sim.6224. Epub 2014 Jun 5.

Genetic-based EM algorithm for learning Gaussian mixture models.用于学习高斯混合模型的基于遗传的期望最大化算法。

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1344-8. doi: 10.1109/TPAMI.2005.162.

Learning Gaussian mixture models with entropy-based criteria.使用基于熵的准则学习高斯混合模型。

IEEE Trans Neural Netw. 2009 Nov;20(11):1756-71. doi: 10.1109/TNN.2009.2030190. Epub 2009 Sep 18.

Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model.使用泊松混合模型对基因表达数据进行序列分析的统计分析和显著性检验。

BMC Bioinformatics. 2007 Aug 2;8:282. doi: 10.1186/1471-2105-8-282.

Applications of beta-mixture models in bioinformatics.β混合模型在生物信息学中的应用。

Bioinformatics. 2005 May 1;21(9):2118-22. doi: 10.1093/bioinformatics/bti318. Epub 2005 Feb 15.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

A mixture model with random-effects components for clustering correlated gene-expression profiles.一种具有随机效应成分的混合模型，用于对相关基因表达谱进行聚类。

Bioinformatics. 2006 Jul 15;22(14):1745-52. doi: 10.1093/bioinformatics/btl165. Epub 2006 May 3.

CHull as an alternative to AIC and BIC in the context of mixtures of factor analyzers.在因子分析混合模型中，CHull 可以替代 AIC 和 BIC。

Behav Res Methods. 2013 Sep;45(3):782-91. doi: 10.3758/s13428-012-0293-y.

本文引用的文献

IntPath--an integrated pathway gene relationship database for model organisms and important pathogens.IntPath——一个针对模式生物和重要病原体的综合通路基因关系数据库。

BMC Syst Biol. 2012;6 Suppl 2(Suppl 2):S2. doi: 10.1186/1752-0509-6-S2-S2. Epub 2012 Dec 12.

A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data.一种用于校正 Illumina Infinium 450k DNA 甲基化数据中探针设计偏差的β混合分位数归一化方法。

Bioinformatics. 2013 Jan 15;29(2):189-96. doi: 10.1093/bioinformatics/bts680. Epub 2012 Nov 21.

Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements.成簇的 ChIP-Seq 定义的转录因子结合位点和组蛋白修饰图谱描绘了不同类别的调控元件。

BMC Biol. 2011 Nov 24;9:80. doi: 10.1186/1741-7007-9-80.

A beta-mixture model for dimensionality reduction, sample classification and analysis.用于降维、样本分类和分析的 Beta 混合模型。

BMC Bioinformatics. 2011 May 27;12:215. doi: 10.1186/1471-2105-12-215.

Bayesian estimation of beta mixture models with variational inference.贝叶斯估计的β混合模型的变分推断。

IEEE Trans Pattern Anal Mach Intell. 2011 Nov;33(11):2160-73. doi: 10.1109/TPAMI.2011.63.

A Beta-mixture model for assessing genetic population structure.一种用于评估遗传群体结构的贝塔混合模型。

Biometrics. 2011 Sep;67(3):1073-82. doi: 10.1111/j.1541-0420.2010.01506.x. Epub 2010 Nov 29.

BMC Bioinformatics. 2009 May 29;10:165. doi: 10.1186/1471-2105-10-165.

Probabilistic inference of transcription factor binding from multiple data sources.基于多数据源的转录因子结合概率推断

PLoS One. 2008 Mar 26;3(3):e1820. doi: 10.1371/journal.pone.0001820.

A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome.哺乳动物基因组中高度简并转录因子结合位点的聚类特性。

Nucleic Acids Res. 2006 May 2;34(8):2238-46. doi: 10.1093/nar/gkl248. Print 2006.

ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation.ORegAnno：一个用于文献衍生启动子、转录因子结合位点和调控变异的开放获取数据库及注释系统。

Bioinformatics. 2006 Mar 1;22(5):637-40. doi: 10.1093/bioinformatics/btk027. Epub 2006 Jan 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验