Smith T M, Abajian C, Hood L
Department of Molecular Biotechnology, University of Washington, Seattle 98195, USA. T.M.Smith,
Comput Appl Biosci. 1997 Apr;13(2):175-82. doi: 10.1093/bioinformatics/13.2.175.
Genome-scale DNA sequencing is a multistep process in which large numbers of small template clones are propagated, purified, sequenced and analyzed on acrylamide gels. A significant challenge to these projects is the scale at which the data handling must be done. Hence, large-scale sequencing facilities will benefit from tracking template DNA information (purification methods, reaction and electrophoresis conditions) in a systematic fashion. A lack of software tools that support automated sample entry, and automatic data storage, retrieval and analysis are a major hindrance to recording and using laboratory workflow information to monitor the overall quality of data production.
The UNIX file system has been used to prototype automation of the flow of data from the ABI sequencer to a data repository. Data are automatically processed by a central Perl program, Hopper, which runs a series of programs that analyze data quality (read length estimate, fraction of indeterminate bases, and number of contaminating and repetitive sequences), assemble shotgun sequence data, and generates simple reports describing the results.
基因组规模的DNA测序是一个多步骤过程,其中大量小模板克隆被扩增、纯化、测序并在丙烯酰胺凝胶上进行分析。这些项目面临的一个重大挑战是数据处理必须达到的规模。因此,大规模测序设施将受益于以系统的方式跟踪模板DNA信息(纯化方法、反应和电泳条件)。缺乏支持自动样本录入以及自动数据存储、检索和分析的软件工具是记录和使用实验室工作流程信息以监测数据生产整体质量的主要障碍。
UNIX文件系统已被用于对从ABI测序仪到数据存储库的数据流自动化进行原型设计。数据由一个中央Perl程序Hopper自动处理,该程序运行一系列程序来分析数据质量(读取长度估计、不确定碱基比例以及污染和重复序列的数量)、组装鸟枪法序列数据并生成描述结果的简单报告。