Healy John, Thomas Elizabeth E, Schwartz Jacob T, Wigler Michael
Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
Genome Res. 2003 Oct;13(10):2306-15. doi: 10.1101/gr.1350803. Epub 2003 Sep 15.
We have developed a tool for rapidly determining the number of exact matches of any word within large, internally repetitive genomes or sets of genomes. Thus we can readily annotate any sequence, including the entire human genome, with the counts of its constituent words. We create a Burrows-Wheeler transform of the genome, which together with auxiliary data structures facilitating counting, can reside in about one gigabyte of RAM. Our original interest was motivated by oligonucleotide probe design, and we describe a general protocol for defining unique hybridization probes. But our method also has applications for the analysis of genome structure and assembly. We demonstrate the identification of chromosome-specific repeats, and outline a general procedure for finding undiscovered repeats. We also illustrate the changing contents of the human genome assemblies by comparing the annotations built from different genome freezes.
我们开发了一种工具,用于快速确定大型内部重复基因组或基因组集合中任何单词的精确匹配数。因此,我们可以轻松地用其组成单词的计数来注释任何序列,包括整个人类基因组。我们创建了基因组的Burrows-Wheeler变换,它与便于计数的辅助数据结构一起,大约可以存储在1GB的随机存取存储器中。我们最初的兴趣源于寡核苷酸探针设计,并且我们描述了一种定义独特杂交探针的通用方案。但我们的方法也可应用于基因组结构和组装的分析。我们展示了染色体特异性重复序列的鉴定,并概述了寻找未发现重复序列的一般程序。我们还通过比较不同基因组冻结版本构建的注释来说明人类基因组组装内容的变化。