Genetics and Molecular Biology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Translational and Functional Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Bioinformatics. 2017 Sep 1;33(17):2615-2621. doi: 10.1093/bioinformatics/btx276.
Epigenetic data are invaluable when determining the regulatory programs governing a cell. Based on use of next-generation sequencing data for characterizing epigenetic marks and transcription factor binding, numerous peak-calling approaches have been developed to determine sites of genomic significance in these data. Such analyses can produce a large number of false positive predictions, suggesting that sites supported by multiple algorithms provide a stronger foundation for inferring and characterizing regulatory programs associated with the epigenetic data. Few methodologies integrate epigenetic based predictions of multiple approaches when combining profiles generated by different tools.
The SigSeeker peak-calling ensemble uses multiple tools to identify peaks, and with user-defined thresholds for peak overlap and signal strength it retains only those peaks that are concordant across multiple tools. Peaks predicted to be co-localized by only a very small number of tools, discovered to be only marginally overlapping, or found to represent significant outliers to the approximation model are removed from the results, providing concise and high quality epigenetic datasets. SigSeeker has been validated using established benchmarks for transcription factor binding and histone modification ChIP-Seq data. These comparisons indicate that the quality of our ensemble technique exceeds that of single tool approaches, enhances existing peak-calling ensembles, and results in epigenetic profiles of higher confidence.
Supplementary data are available at Bioinformatics online.
在确定调控细胞的调控程序时,表观遗传数据是非常宝贵的。基于使用下一代测序数据来描述表观遗传标记和转录因子结合,已经开发了许多峰调用方法来确定这些数据中基因组意义的位点。这些分析可能会产生大量的假阳性预测,这表明由多个算法支持的位点为推断和描述与表观遗传数据相关的调控程序提供了更强的基础。当结合不同工具生成的图谱时,很少有方法将多种方法的基于表观遗传的预测进行整合。
SigSeeker 峰调用集成使用多种工具来识别峰,并且使用用户定义的峰重叠和信号强度阈值,仅保留那些在多个工具中一致的峰。仅被极少数工具预测为共定位的峰、发现仅略微重叠的峰,或被发现是近似模型的显著离群值的峰都从结果中删除,从而提供简洁且高质量的表观遗传数据集。SigSeeker 已经使用转录因子结合和组蛋白修饰 ChIP-Seq 数据的既定基准进行了验证。这些比较表明,我们的集成技术的质量优于单一工具方法,增强了现有的峰调用集成,并导致更高置信度的表观遗传图谱。
补充数据可在 Bioinformatics 在线获得。