Diaz Aaron, Park Kiyoub, Lim Daniel A, Song Jun S
University of California, San Francisco, USA.
Stat Appl Genet Mol Biol. 2012 Mar 31;11(3):Article 9. doi: 10.1515/1544-6115.1750.
Next-generation sequencing is rapidly transforming our ability to profile the transcriptional, genetic, and epigenetic states of a cell. In particular, sequencing DNA from the immunoprecipitation of protein-DNA complexes (ChIP-seq) and methylated DNA (MeDIP-seq) can reveal the locations of protein binding sites and epigenetic modifications. These approaches contain numerous biases which may significantly influence the interpretation of the resulting data. Rigorous computational methods for detecting and removing such biases are still lacking. Also, multi-sample normalization still remains an important open problem. This theoretical paper systematically characterizes the biases and properties of ChIP-seq data by comparing 62 separate publicly available datasets, using rigorous statistical models and signal processing techniques. Statistical methods for separating ChIP-seq signal from background noise, as well as correcting enrichment test statistics for sequence-dependent and sonication biases, are presented. Our method effectively separates reads into signal and background components prior to normalization, improving the signal-to-noise ratio. Moreover, most peak callers currently use a generic null model which suffers from low specificity at the sensitivity level requisite for detecting subtle, but true, ChIP enrichment. The proposed method of determining a cell type-specific null model, which accounts for cell type-specific biases, is shown to be capable of achieving a lower false discovery rate at a given significance threshold than current methods.
新一代测序技术正在迅速改变我们描绘细胞转录、遗传和表观遗传状态的能力。特别是,对蛋白质-DNA复合物免疫沉淀(ChIP-seq)和甲基化DNA(MeDIP-seq)的DNA进行测序,可以揭示蛋白质结合位点和表观遗传修饰的位置。这些方法存在许多偏差,可能会显著影响对所得数据的解释。目前仍缺乏用于检测和消除此类偏差的严格计算方法。此外,多样本归一化仍然是一个重要的开放性问题。这篇理论论文通过使用严格的统计模型和信号处理技术,比较62个单独的公开可用数据集,系统地描述了ChIP-seq数据的偏差和特性。文中提出了将ChIP-seq信号与背景噪声分离的统计方法,以及针对序列依赖性和超声处理偏差校正富集测试统计量的方法。我们的方法在归一化之前有效地将 reads 分离为信号和背景成分,提高了信噪比。此外,目前大多数峰检测工具使用的是通用的空模型,在检测细微但真实的ChIP富集所需的灵敏度水平下,其特异性较低。所提出的确定细胞类型特异性空模型的方法,该方法考虑了细胞类型特异性偏差,结果表明在给定的显著性阈值下,与当前方法相比能够实现更低的错误发现率。