Suppr超能文献

一种用于在转录和芯片免疫沉淀实验中有效分割平铺阵列数据的监督隐马尔可夫模型框架:系统地整合经过验证的生物学知识。

A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge.

作者信息

Du Jiang, Rozowsky Joel S, Korbel Jan O, Zhang Zhengdong D, Royce Thomas E, Schultz Martin H, Snyder Michael, Gerstein Mark

机构信息

Department of Computer Science, Yale University, New Haven, CT 06520, USA.

出版信息

Bioinformatics. 2006 Dec 15;22(24):3016-24. doi: 10.1093/bioinformatics/btl515. Epub 2006 Oct 12.

Abstract

MOTIVATION

Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing.

METHODOLOGY

In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively).

RESULTS

For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

摘要

动机

大规模平铺阵列实验在基因组学中越来越普遍。特别是,ENCODE项目要求将许多不同的平铺阵列数据集一致地分割成“活性区域”(例如,从转录数据中找到转录片段,从ChIP芯片实验中找到假定的结合位点)。以前,这种分割主要以无监督的方式进行,主要基于平铺阵列数据本身的信号分布特征。在此,我们提出了一个用于此目的的监督框架。它具有将经过验证的生物学知识明确纳入模型并允许进行正式训练和测试的优点。

方法

具体而言,我们使用隐马尔可夫模型(HMM)框架,该框架能够明确地对相邻探针之间的依赖性进行建模,其扩展版本(广义HMM)还允许对状态持续时间密度进行明确描述。我们引入了平铺阵列分析问题的正式定义,并解释了如何使用它来描述对小基因组区域进行采样以进行实验验证,从而构建用于训练和测试的金标准集。然后,我们描述了各种理想和实际的采样策略(例如,在选定区域内最大化信号熵,或者分别将基因注释或已知启动子用作转录或ChIP芯片数据的阳性样本)。

结果

对于实际的采样和训练策略,我们展示了经过验证的训练数据中的大小和噪声如何影响应用于ENCODE转录和ChIP芯片实验的HMM的性能。特别是,我们表明HMM框架能够有效地处理平铺阵列数据,并且与以前的方法一样好或更好。对于理想化的采样策略,我们展示了如何在模拟框架中评估它们的性能,以及一种最大熵方法如何对具有非常不同信号强度的子区域进行采样,从而给出性能最佳的金标准。后一个结果对于进行中等规模验证实验以验证基因组规模平铺阵列实验结果的最佳方式具有重要意义。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验