具有重复和异常值测量的微阵列时间序列数据的贝叶斯层次聚类。

Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements.

机构信息

Systems Biology Centre, University of Warwick, Coventry, UK.

出版信息

BMC Bioinformatics. 2011 Oct 13;12:399. doi: 10.1186/1471-2105-12-399.

DOI:10.1186/1471-2105-12-399

PMID:21995452

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3228548/

Abstract

BACKGROUND

Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.

RESULTS

We present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.

CONCLUSIONS

By incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.

摘要

背景

后基因组分子生物学导致了数据的爆炸式增长，为大量基因、蛋白质和代谢物提供了测量值。时间序列实验变得越来越普遍，需要开发新的分析工具来捕捉由此产生的数据结构。一个或多个时间点的异常测量值是一个重大挑战，而现有技术通常忽略了潜在有价值的重复信息。

结果

我们提出了一种基于生成模型的贝叶斯层次聚类算法，用于微阵列时间序列，该算法使用高斯过程回归来捕获数据的结构。通过使用混合模型似然，我们的方法允许一小部分数据被建模为异常测量值，并采用经验贝叶斯方法，使用重复观测来为噪声方差的先验分布提供信息。该方法自动学习最佳的聚类数量，并可以包含非均匀采样的时间点。使用各种各样的实验数据集，我们表明，我们的算法始终产生比当前最先进的方法更高质量和更具生物学意义的聚类。我们通过演示嘈杂的基因可以与具有相似生物学功能的其他基因一起分组，强调了对异常值建模的重要性。我们还强调了包含重复信息的重要性，发现这可以区分其他不同的表达谱。

结论

通过纳入异常测量值和重复值，这种时间序列微阵列数据的聚类算法为更好地处理高通量基因组技术测量值中固有的噪声迈出了一步。Timeseries BHC 作为 R 包“BHC”（版本 1.5）的一部分提供，可从 Bioconductor（版本 2.9 及以上）通过 http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all 下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9931/3228548/99be2e2abfce/1471-2105-12-399-1.jpg

相似文献

Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements.具有重复和异常值测量的微阵列时间序列数据的贝叶斯层次聚类。

BMC Bioinformatics. 2011 Oct 13;12:399. doi: 10.1186/1471-2105-12-399.

Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm.基于随机算法加速时间序列数据的贝叶斯层次聚类。

PLoS One. 2013;8(4):e59795. doi: 10.1371/journal.pone.0059795. Epub 2013 Apr 2.

Bayesian mixture model based clustering of replicated microarray data.基于贝叶斯混合模型的重复微阵列数据聚类

Bioinformatics. 2004 May 22;20(8):1222-32. doi: 10.1093/bioinformatics/bth068. Epub 2004 Feb 10.

R/BHC: fast Bayesian hierarchical clustering for microarray data.R/BHC：用于微阵列数据的快速贝叶斯层次聚类

BMC Bioinformatics. 2009 Aug 6;10:242. doi: 10.1186/1471-2105-10-242.

Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters.基于层次贝叶斯模型的基因表达时间序列在不规则采样重复和聚类中的分析。

BMC Bioinformatics. 2013 Aug 20;14:252. doi: 10.1186/1471-2105-14-252.

Bayesian infinite mixture model based clustering of gene expression profiles.基于贝叶斯无限混合模型的基因表达谱聚类

Bioinformatics. 2002 Sep;18(9):1194-206. doi: 10.1093/bioinformatics/18.9.1194.

Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics.用于研究统计信息未知的癌症基因表达数据的贝叶斯层次聚类法。

PLoS One. 2013 Oct 23;8(10):e75748. doi: 10.1371/journal.pone.0075748. eCollection 2013.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

Including probe-level measurement error in robust mixture clustering of replicated microarray gene expression.在复制微阵列基因表达的稳健混合聚类中纳入探针水平测量误差。

Stat Appl Genet Mol Biol. 2010;9:Article42. doi: 10.2202/1544-6115.1600. Epub 2010 Dec 9.

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.使用功能类别参考集评估基因表达数据聚类算法的方法。

BMC Bioinformatics. 2006 Aug 31;7:397. doi: 10.1186/1471-2105-7-397.

引用本文的文献

GeTeSEPdb: A comprehensive database and online tool for the identification and analysis of gene profiles with temporal-specific expression patterns.GeTeSEPdb：一个用于识别和分析具有时间特异性表达模式的基因图谱的综合数据库及在线工具。

Comput Struct Biotechnol J. 2024 Jun 5;23:2488-2496. doi: 10.1016/j.csbj.2024.06.003. eCollection 2024 Dec.

Highly dynamic inflammatory and excitability transcriptional profiles in hippocampal CA1 following status epilepticus.癫痫持续状态后海马 CA1 区高度动态的炎症和兴奋性转录谱。

Sci Rep. 2023 Dec 14;13(1):22187. doi: 10.1038/s41598-023-49310-y.

Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics.空间蛋白质组学的半监督非参数贝叶斯建模

Ann Appl Stat. 2022 Dec 1;16(4). doi: 10.1214/22-AOAS1603.

A Bayesian semi-parametric model for thermal proteome profiling.用于热蛋白质组谱分析的贝叶斯半参数模型。

Commun Biol. 2021 Jun 29;4(1):810. doi: 10.1038/s42003-021-02306-8.

RVAgene: generative modeling of gene expression time series data.RVAgene：基因表达时间序列数据的生成式建模

Bioinformatics. 2021 Oct 11;37(19):3252-3262. doi: 10.1093/bioinformatics/btab260.

Multiple kernel learning for integrative consensus clustering of omic datasets.基于多核学习的组学数据集综合共识聚类分析。

Bioinformatics. 2020 Sep 15;36(18):4789-4796. doi: 10.1093/bioinformatics/btaa593.

Lag penalized weighted correlation for time series clustering.滞后惩罚加权相关的时间序列聚类。

BMC Bioinformatics. 2020 Jan 16;21(1):21. doi: 10.1186/s12859-019-3324-1.

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics.狄利克雷过程混合模型中用于变量选择的快速近似推断及其在泛癌蛋白质组学中的应用

Stat Appl Genet Mol Biol. 2019 Dec 12;18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0065/sagmb-2018-0065.xml. doi: 10.1515/sagmb-2018-0065.

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution.GPseudoClust：单细胞分辨率下共享伪轮廓的去卷积。

Bioinformatics. 2020 Mar 1;36(5):1484-1491. doi: 10.1093/bioinformatics/btz778.

Model-Based Clustering With Data Correction For Removing Artifacts In Gene Expression Data.基于模型的聚类与数据校正以去除基因表达数据中的伪迹

Ann Appl Stat. 2016 Feb;11(4):1998-2026. doi: 10.1214/17-AOAS1051. Epub 2017 Dec 28.

本文引用的文献

Discovering transcriptional modules by Bayesian data integration.基于贝叶斯数据整合的转录模块发现。

Bioinformatics. 2010 Jun 15;26(12):i158-67. doi: 10.1093/bioinformatics/btq210.

A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series.一种用于检测微阵列时间序列中差异基因表达区间的稳健贝叶斯双样本检验。

J Comput Biol. 2010 Mar;17(3):355-67. doi: 10.1089/cmb.2009.0175.

Estimating replicate time shifts using Gaussian process regression.使用高斯过程回归估计重复时间偏移。

Bioinformatics. 2010 Mar 15;26(6):770-6. doi: 10.1093/bioinformatics/btq022. Epub 2010 Feb 9.

R/BHC: fast Bayesian hierarchical clustering for microarray data.R/BHC：用于微阵列数据的快速贝叶斯层次聚类

BMC Bioinformatics. 2009 Aug 6;10:242. doi: 10.1186/1471-2105-10-242.

Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data.高斯过程回归自抽样法：探究时程数据中不确定性的影响

Bioinformatics. 2009 May 15;25(10):1300-6. doi: 10.1093/bioinformatics/btp139. Epub 2009 Mar 16.

Global control of cell-cycle transcription by coupled CDK and network oscillators.通过耦合的细胞周期蛋白依赖性激酶（CDK）和网络振荡器对细胞周期转录进行全局调控。

Nature. 2008 Jun 12;453(7197):944-7. doi: 10.1038/nature06955. Epub 2008 May 7.

Analyzing gene expression time-courses.分析基因表达时间进程。

IEEE/ACM Trans Comput Biol Bioinform. 2005 Jul-Sep;2(3):179-93. doi: 10.1109/TCBB.2005.31.

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.使用功能类别参考集评估基因表达数据聚类算法的方法。

BMC Bioinformatics. 2006 Aug 31;7:397. doi: 10.1186/1471-2105-7-397.

A Bayesian mixture model for partitioning gene expression data.一种用于划分基因表达数据的贝叶斯混合模型。

Biometrics. 2006 Jun;62(2):515-25. doi: 10.1111/j.1541-0420.2005.00492.x.

The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the S-phase gap in the transcriptional circuitry of the cell cycle.叉头转录因子Hcm1调节染色体分离基因，并填补细胞周期转录调控回路中的S期间隙。

Genes Dev. 2006 Aug 15;20(16):2266-78. doi: 10.1101/gad.1450606.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

具有重复和异常值测量的微阵列时间序列数据的贝叶斯层次聚类。

Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献