Suppr超能文献

一种用于人类表达基因序列聚类的综合方法:序列标签比对与共有知识库。

A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base.

作者信息

Miller R T, Christoffels A G, Gopalakrishnan C, Burke J, Ptitsyn A A, Broveak T R, Hide W A

机构信息

South African National Bioinformatics Institute, Private Bag X17, Bellville 7535, University of the Western Cape, South Africa.

出版信息

Genome Res. 1999 Nov;9(11):1143-55. doi: 10.1101/gr.9.11.1143.

Abstract

The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313, 103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1. 86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.

摘要

不同的研究团队正在对已表达的人类基因组进行测序和分析,产生了各不相同的数据。已识别出的编码部分大多以表达序列标签(EST)的形式存在。由于这种数据传递具有部分性和质量参差不齐的特点,要发现每个人类基因的外显子表现形式和全长cDNA的表达形式变得很困难。一个高度冗余的人类EST数据集已被处理成整合统一的表达转录本索引,该索引由分层组织的人类转录本共有序列组成,反映了索引类别内的基因表达形式和遗传多态性。表达索引及其中间输出包括清理后的转录本序列、表达和比对信息以及一个更高保真度的子集SANIGENE。STACK_PACK聚类系统已应用于dbEST第121598版(GenBank第版110)。1313103条智人EST中的64%被浓缩成143885个组织水平的多序列簇;通过克隆ID注释进行链接产生了68701个总装配体,这样原始输入集中的81%被捕获在一个STACK多序列或链接簇中。通过取代EST登录号对比对进行索引,可浏览数据结构及其与UniGene的交叉链接。与相应的UniGene构建相比,STACK元簇将更多数量的EST整合起来,整合系数为1.86。与基因组参考序列AC004106的保真度比较表明,共有表达簇反映出明显更低的假重复序列含量,并在一个全身索引簇和三个STACK v.2.3组织水平簇中捕获了可变剪接。文中给出了STACK v.2.0交错发布的全身索引构建的统计数据。

相似文献

3
Evaluation of EST-data using the genome assembly.利用基因组组装对EST数据进行评估。
Biochem Biophys Res Commun. 2005 Jun 17;331(4):1566-76. doi: 10.1016/j.bbrc.2005.04.070.
10
Genome analysis with gene-indexing databases.利用基因索引数据库进行基因组分析。
Pharmacol Ther. 2001 Aug;91(2):115-32. doi: 10.1016/s0163-7258(01)00151-6.

引用本文的文献

4
Comparison of metatranscriptomic samples based on k-tuple frequencies.基于k元组频率的宏转录组样本比较。
PLoS One. 2014 Jan 2;9(1):e84348. doi: 10.1371/journal.pone.0084348. eCollection 2014.

本文引用的文献

5
Comparison of gene indexing databases.基因索引数据库的比较。
Trends Genet. 1999 Apr;15(4):159-62. doi: 10.1016/s0168-9525(99)01709-6.
6
Repeats in genomic DNA: mining and meaning.基因组DNA中的重复序列:挖掘与意义
Curr Opin Struct Biol. 1998 Jun;8(3):333-7. doi: 10.1016/s0959-440x(98)80067-5.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验