Kuehl P M, Weisemann J M, Touchman J W, Green E D, Boguski M S
University of Maryland, Department of Molecular and Cellular Biology, Baltimore, Maryland 21201, USA.
Genome Res. 1999 Feb;9(2):189-94.
Ongoing efforts to sequence the human genome are already generating large amounts of data, with substantial increases anticipated over the next few years. In most cases, a shotgun sequencing strategy is being used, which rapidly yields most of the primary sequence in incompletely assembled sequence contigs ("prefinished" sequence) and more slowly produces the final, completely assembled sequence ("finished" sequence). Thus, in general, prefinished sequence is produced in excess of finished sequence, and this trend is certain to continue and even accelerate over the next few years. Even at a prefinished stage, genomic sequence represents a rich source of important biological information that is of great interest to many investigators. However, analyzing such data is a challenging and daunting task, both because of its sheer volume and because it can change on a day-by-day basis. To facilitate the discovery and characterization of genes and other important elements within prefinished sequence, we have developed an analytical strategy and system that uses readily available software tools in new combinations. Implementation of this strategy for the analysis of prefinished sequence data from human chromosome 7 has demonstrated that this is a convenient, inexpensive, and extensible solution to the problem of analyzing the large amounts of preliminary data being produced by large-scale sequencing efforts. Our approach is accessible to any investigator who wishes to assimilate additional information about particular sequence data en route to developing richer annotations of a finished sequence.
正在进行的人类基因组测序工作已经产生了大量数据,预计在未来几年还会大幅增加。在大多数情况下,采用的是鸟枪法测序策略,这种策略能迅速产生大部分存在于未完全组装的序列重叠群(“预完成”序列)中的初级序列,而生成最终的、完全组装好的序列(“完成”序列)则较为缓慢。因此,一般来说,预完成序列的产出量超过了完成序列,而且这种趋势在未来几年肯定会持续甚至加速。即使在预完成阶段,基因组序列也是重要生物信息的丰富来源,许多研究人员对此都很感兴趣。然而,分析这些数据是一项具有挑战性且艰巨的任务,这不仅是因为数据量巨大,还因为它可能每天都在变化。为了便于在预完成序列中发现和鉴定基因及其他重要元件,我们开发了一种分析策略和系统,该策略和系统将现成的软件工具以新的组合方式加以运用。对来自人类7号染色体的预完成序列数据实施这一分析策略表明,这是一种方便、廉价且可扩展的解决方案,能解决大规模测序工作所产生的大量初步数据的分析问题。任何希望在完善完成序列注释的过程中获取特定序列数据更多信息的研究人员都可以采用我们的方法。