Miller R T, Christoffels A G, Gopalakrishnan C, Burke J, Ptitsyn A A, Broveak T R, Hide W A
South African National Bioinformatics Institute, Private Bag X17, Bellville 7535, University of the Western Cape, South Africa.
Genome Res. 1999 Nov;9(11):1143-55. doi: 10.1101/gr.9.11.1143.
The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313, 103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1. 86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.
不同的研究团队正在对已表达的人类基因组进行测序和分析,产生了各不相同的数据。已识别出的编码部分大多以表达序列标签(EST)的形式存在。由于这种数据传递具有部分性和质量参差不齐的特点,要发现每个人类基因的外显子表现形式和全长cDNA的表达形式变得很困难。一个高度冗余的人类EST数据集已被处理成整合统一的表达转录本索引,该索引由分层组织的人类转录本共有序列组成,反映了索引类别内的基因表达形式和遗传多态性。表达索引及其中间输出包括清理后的转录本序列、表达和比对信息以及一个更高保真度的子集SANIGENE。STACK_PACK聚类系统已应用于dbEST第121598版(GenBank第版110)。1313103条智人EST中的64%被浓缩成143885个组织水平的多序列簇;通过克隆ID注释进行链接产生了68701个总装配体,这样原始输入集中的81%被捕获在一个STACK多序列或链接簇中。通过取代EST登录号对比对进行索引,可浏览数据结构及其与UniGene的交叉链接。与相应的UniGene构建相比,STACK元簇将更多数量的EST整合起来,整合系数为1.86。与基因组参考序列AC004106的保真度比较表明,共有表达簇反映出明显更低的假重复序列含量,并在一个全身索引簇和三个STACK v.2.3组织水平簇中捕获了可变剪接。文中给出了STACK v.2.0交错发布的全身索引构建的统计数据。