Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
Bioinformatics. 2011 Aug 1;27(15):2144-6. doi: 10.1093/bioinformatics/btr354. Epub 2011 Jun 19.
Sequencing-based assays such as ChIP-seq, DNase-seq and MNase-seq have become important tools for genome annotation. In these assays, short sequence reads enriched for loci of interest are mapped to a reference genome to determine their origin. Here, we consider whether false positive peak calls can be caused by particular type of error in the reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy.
Using sequencing data from the 1000 Genomes Project, we systematically scanned the human genome for regions of high sequencing depth. These regions are highly enriched for erroneously inferred transcription factor binding sites, positions of nucleosomes and regions of open chromatin. We suggest a simple masking procedure to remove these regions and reduce false positive calls.
Files for masking out these regions are available at eqtl.uchicago.edu
基于测序的测定方法,如 ChIP-seq、DNase-seq 和 MNase-seq,已成为基因组注释的重要工具。在这些测定方法中,富集感兴趣基因座的短序列读取被映射到参考基因组上,以确定其来源。在这里,我们考虑参考基因组中的特定类型错误是否会导致假阳性峰调用:多拷贝序列被错误组装并折叠成单个拷贝。
我们使用来自 1000 基因组计划的测序数据,系统地扫描人类基因组中测序深度较高的区域。这些区域高度富含错误推断的转录因子结合位点、核小体位置和开放染色质区域。我们建议使用一种简单的屏蔽程序来删除这些区域并减少假阳性调用。
可在 eqtl.uchicago.edu 获得用于屏蔽这些区域的文件。