Liang Chun, Wang Gang, Liu Lin, Ji Guoli, Fang Lin, Liu Yuansheng, Carter Kikia, Webb Jason S, Dean Jeffrey F D
Department of Botany, Miami University, Oxford, Ohio 45056, USA.
BMC Genomics. 2007 May 29;8:134. doi: 10.1186/1471-2164-8-134.
With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences.
ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software--WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces.
ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.
随着低成本、高通量测序技术的出现,可用于模式生物和非模式生物的公共领域表达序列标签(EST)序列数据量呈指数级增长。虽然这些数据被广泛用于表征各种基因组,但由于其固有的缺陷,特别是对于没有基因组序列的物种,它们也给数据质量控制和验证带来了严峻挑战。
ConiferEST是一个用于针叶树EST数据再处理、可视化和挖掘的集成系统。在其当前版本1.0中,它包含172,229条火炬松EST序列读数,这些读数是通过使用我们的软件WebTraceMiner对原始DNA测序仪痕迹进行再处理而获得的。痕迹文件从NCBI痕迹存档中下载。ConiferEST为生物学家提供了独特且易于使用的数据可视化和挖掘工具,用于分析各种假定的序列特征,包括克隆载体片段、接头序列、限制性内切酶识别位点、聚腺苷酸和聚胸腺嘧啶序列及其相应的Phred质量值。基于这些假定特征,已通过计算机模拟鉴定出诸如有义或反义链中cDNA插入片段的3'和/或5'末端等经过验证的序列特征。有趣的是,在反义链中,仅30.03%的指定3' EST被发现具有经过验证的5'末端(即聚胸腺嘧啶尾巴),而在有义链中,少于5.34%的指定5' EST具有经过验证的5'末端。这些先前被忽视的特征为易出错的EST的数据质量控制和验证提供了有价值的见解,同时也有助于识别大型EST数据集中嵌入的新型功能基序。我们发现“双末端接头”是潜在EST嵌合体的有效指标。对于所有在计算机模拟中经过验证的末端/末端序列,我们使用InterProScan来分配蛋白质结构域特征,其结果可通过我们对生物学家友好的网络界面进行深入探索。
ConiferEST通过对原始DNA痕迹进行再处理、识别假定的序列特征以及确定和注释计算机模拟验证的特征,代表了针叶树EST数据集成和挖掘的独特且互补的公共资源。ConiferEST与其他公共资源无缝集成,为生物学家提供了强大的工具来验证数据、可视化异常情况(包括EST嵌合体)以及探索大型EST数据集。