Bouck J, Miller W, Gorrell J H, Muzny D, Gibbs R A
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030 USA.
Genome Res. 1998 Oct;8(10):1074-84. doi: 10.1101/gr.8.10.1074.
The currently favored approach for sequencing the human genome involves selecting representative large-insert clones (100-200 kb), randomly shearing this DNA to construct shotgun libraries, and then sequencing many different isolates from the library. This method, entitled directed random shotgun sequencing, requires highly redundant sequencing to obtain a complete and accurate finished consensus sequence. Recently it has been suggested that a rapidly generated lower redundancy sequence might be of use to the scientific community. Low-redundancy sequencing has been examined previously using simulated data sets. Here we utilize trace data from a number of projects submitted to GenBank to perform reconstruction experiments that mimic low-redundancy sequencing. These low-redundancy sequences have been examined for the completeness and quality of the consensus product, information content, and usefulness for interspecies comparisons. The data presented here suggest three different sequencing strategies, each with different utilities. (1) Nearly complete sequence data can be obtained by sequencing a random shotgun library at sixfold redundancy. This may therefore represent a good point to switch from a random to directed approach. (2) Sequencing can be performed with as little as twofold redundancy to find most of the information about exons, EST hits, and putative exon similarity matches. (3) To obtain contiguity of coding regions, sequencing at three- to fourfold redundancy would be appropriate. From these results, we suggest that a useful intermediate product for genome sequencing might be obtained by three- to fourfold redundancy. Such a product would allow a large amount of biologically useful data to be extracted while postponing the majority of work involved in producing a high quality consensus sequence.
目前备受青睐的人类基因组测序方法包括选择具有代表性的大插入片段克隆(100 - 200 kb),随机剪切该DNA以构建鸟枪法文库,然后对文库中的许多不同分离株进行测序。这种方法称为定向随机鸟枪法测序,需要高度冗余的测序才能获得完整准确的最终一致序列。最近有人提出,快速生成的低冗余序列可能对科学界有用。此前已使用模拟数据集对低冗余测序进行过研究。在这里,我们利用提交给GenBank的多个项目的追踪数据来进行模拟低冗余测序的重建实验。我们已经对这些低冗余序列的一致产物的完整性和质量、信息含量以及种间比较的有用性进行了研究。此处呈现的数据表明了三种不同的测序策略,每种策略都有不同的用途。(1)通过对随机鸟枪法文库进行六倍冗余测序可以获得几乎完整的序列数据。因此,这可能是从随机方法转向定向方法的一个好时机。(2)测序可以低至两倍冗余进行,以找到有关外显子、EST匹配以及推定的外显子相似性匹配的大部分信息。(3)为了获得编码区的连续性,三到四倍冗余测序是合适的。从这些结果来看,我们认为通过三到四倍冗余可能获得基因组测序的一种有用的中间产物。这样的产物将能够提取大量生物学上有用的数据,同时推迟生成高质量一致序列所涉及的大部分工作。