一种用于人类表达基因序列聚类的综合方法：序列标签比对与共有知识库。

A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base.

作者信息

Miller R T, Christoffels A G, Gopalakrishnan C, Burke J, Ptitsyn A A, Broveak T R, Hide W A

机构信息

South African National Bioinformatics Institute, Private Bag X17, Bellville 7535, University of the Western Cape, South Africa.

出版信息

Genome Res. 1999 Nov;9(11):1143-55. doi: 10.1101/gr.9.11.1143.

DOI:10.1101/gr.9.11.1143

PMID:10568754

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC310831/

Abstract

The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313, 103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1. 86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.

摘要

不同的研究团队正在对已表达的人类基因组进行测序和分析，产生了各不相同的数据。已识别出的编码部分大多以表达序列标签（EST）的形式存在。由于这种数据传递具有部分性和质量参差不齐的特点，要发现每个人类基因的外显子表现形式和全长cDNA的表达形式变得很困难。一个高度冗余的人类EST数据集已被处理成整合统一的表达转录本索引，该索引由分层组织的人类转录本共有序列组成，反映了索引类别内的基因表达形式和遗传多态性。表达索引及其中间输出包括清理后的转录本序列、表达和比对信息以及一个更高保真度的子集SANIGENE。STACK_PACK聚类系统已应用于dbEST第121598版（GenBank第版110）。1313103条智人EST中的64%被浓缩成143885个组织水平的多序列簇；通过克隆ID注释进行链接产生了68701个总装配体，这样原始输入集中的81%被捕获在一个STACK多序列或链接簇中。通过取代EST登录号对比对进行索引，可浏览数据结构及其与UniGene的交叉链接。与相应的UniGene构建相比，STACK元簇将更多数量的EST整合起来，整合系数为1.86。与基因组参考序列AC004106的保真度比较表明，共有表达簇反映出明显更低的假重复序列含量，并在一个全身索引簇和三个STACK v.2.3组织水平簇中捕获了可变剪接。文中给出了STACK v.2.0交错发布的全身索引构建的统计数据。

相似文献

A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base.一种用于人类表达基因序列聚类的综合方法：序列标签比对与共有知识库。

Genome Res. 1999 Nov;9(11):1143-55. doi: 10.1101/gr.9.11.1143.

STACK: Sequence Tag Alignment and Consensus Knowledgebase.STACK：序列标签比对与一致性知识库。

Nucleic Acids Res. 2001 Jan 1;29(1):234-8. doi: 10.1093/nar/29.1.234.

Evaluation of EST-data using the genome assembly.利用基因组组装对EST数据进行评估。

Biochem Biophys Res Commun. 2005 Jun 17;331(4):1566-76. doi: 10.1016/j.bbrc.2005.04.070.

[A new method for EST clustering].[一种用于EST聚类的新方法]

Yi Chuan Xue Bao. 2003 Feb;30(2):147-53.

d2_cluster: a validated method for clustering EST and full-length cDNAsequences.d2聚类：一种用于对EST和全长cDNA序列进行聚类的有效方法。

Genome Res. 1999 Nov;9(11):1135-42. doi: 10.1101/gr.9.11.1135.

JESAM: CORBA software components to create and publish EST alignments and clusters.JESAM：用于创建和发布EST比对及聚类的CORBA软件组件。

Bioinformatics. 2000 Apr;16(4):313-25. doi: 10.1093/bioinformatics/16.4.313.

EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data.EasyCluster：一种用于大规模转录组数据的快速高效的面向基因的聚类工具。

BMC Bioinformatics. 2009 Jun 16;10 Suppl 6(Suppl 6):S10. doi: 10.1186/1471-2105-10-S6-S10.

Efficient clustering of large EST data sets on parallel computers.在并行计算机上对大型EST数据集进行高效聚类

Nucleic Acids Res. 2003 Jun 1;31(11):2963-74. doi: 10.1093/nar/gkg379.

The first set of EST resource for gene discovery and marker development in pigeonpea (Cajanus cajan L.).豆科兵豆属 EST 资源的开发及其在基因发现和标记辅助选择中的应用

BMC Plant Biol. 2010 Mar 11;10:45. doi: 10.1186/1471-2229-10-45.

Genome analysis with gene-indexing databases.利用基因索引数据库进行基因组分析。

Pharmacol Ther. 2001 Aug;91(2):115-32. doi: 10.1016/s0163-7258(01)00151-6.

引用本文的文献

Alignment-free inference of hierarchical and reticulate phylogenomic relationships.基于无比对的方法推断系统发生的分支和网状结构关系。

Brief Bioinform. 2019 Mar 22;20(2):426-435. doi: 10.1093/bib/bbx067.

Inferring phylogenies of evolving sequences without multiple sequence alignment.无需多序列比对推断进化序列的系统发育树。

Sci Rep. 2014 Sep 30;4:6504. doi: 10.1038/srep06504.

Identification and analysis of expressed sequence tags present in xylem tissues of kelampayan (Neolamarckia cadamba (Roxb.) Bosser).鉴定和分析凯拉潘（Neolamarckia cadamba (Roxb.) Bosser）木质部组织中的表达序列标签。

Physiol Mol Biol Plants. 2014 Jul;20(3):393-7. doi: 10.1007/s12298-014-0230-x. Epub 2014 May 24.

Comparison of metatranscriptomic samples based on k-tuple frequencies.基于k元组频率的宏转录组样本比较。

PLoS One. 2014 Jan 2;9(1):e84348. doi: 10.1371/journal.pone.0084348. eCollection 2014.

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.无比对序列比较的新进展：度量、统计学与新一代测序

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

Development of pineapple microsatellite markers and germplasm genetic diversity analysis.菠萝微卫星标记的开发与种质遗传多样性分析。

Biomed Res Int. 2013;2013:317912. doi: 10.1155/2013/317912. Epub 2013 Aug 19.

A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.一种用于聚类来自同一基因家族的表达序列标签的混合距离度量方法。

PLoS One. 2012;7(10):e47216. doi: 10.1371/journal.pone.0047216. Epub 2012 Oct 11.

RCDA: a highly sensitive and specific alternatively spliced transcript assembly tool featuring upstream consecutive exon structures.RCDA：一种高度敏感和特异的可变剪接转录本组装工具，具有上游连续外显子结构。

Genomics. 2012 Dec;100(6):357-62. doi: 10.1016/j.ygeno.2012.08.004. Epub 2012 Aug 20.

Characterisation of full-length cDNA sequences provides insights into the Eimeria tenella transcriptome.全长 cDNA 序列的特征分析可深入了解柔嫩艾美耳球虫转录组。

BMC Genomics. 2012 Jan 13;13:21. doi: 10.1186/1471-2164-13-21.

Expressed sequence tag analysis of khat (Catha edulis) provides a putative molecular biochemical basis for the biosynthesis of phenylpropylamino alkaloids.卡特叶（Catha edulis）表达序列标签分析为苯丙基氨基生物碱生物合成提供了一个可能的分子生化基础。

Genet Mol Biol. 2011 Oct;34(4):640-6. doi: 10.1590/S1415-47572011000400017. Epub 2011 Oct 1.

本文引用的文献

d2_cluster: a validated method for clustering EST and full-length cDNAsequences.d2聚类：一种用于对EST和全长cDNA序列进行聚类的有效方法。

Genome Res. 1999 Nov;9(11):1135-42. doi: 10.1101/gr.9.11.1135.

Mutations in a novel retina-specific gene cause autosomal dominant retinitis pigmentosa.一种新的视网膜特异性基因中的突变导致常染色体显性遗传性视网膜色素变性。

Nat Genet. 1999 Jul;22(3):255-9. doi: 10.1038/10314.

CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences.CRAWview：用于查看EST和全长序列簇中的剪接变异、基因家族及多态性。

Bioinformatics. 1999 May;15(5):376-81. doi: 10.1093/bioinformatics/15.5.376.

Identifying and mapping novel retinal-expressed ESTs from humans.鉴定和定位来自人类的新型视网膜表达的ESTs。

Mol Vis. 1999 May 4;5:5.

Comparison of gene indexing databases.基因索引数据库的比较。

Trends Genet. 1999 Apr;15(4):159-62. doi: 10.1016/s0168-9525(99)01709-6.

Repeats in genomic DNA: mining and meaning.基因组DNA中的重复序列：挖掘与意义

Curr Opin Struct Biol. 1998 Jun;8(3):333-7. doi: 10.1016/s0959-440x(98)80067-5.

Alternative gene form discovery and candidate gene selection from gene indexing projects.从基因索引项目中发现替代基因形式并选择候选基因。

Genome Res. 1998 Mar;8(3):276-90. doi: 10.1101/gr.8.3.276.

Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis.通过表达序列标签数据库分析发现三个在人类前列腺中特异性表达的基因。

Proc Natl Acad Sci U S A. 1998 Jan 6;95(1):300-4. doi: 10.1073/pnas.95.1.300.

A tool for analyzing and annotating genomic sequences.一种用于分析和注释基因组序列的工具。

Genomics. 1997 Nov 15;46(1):37-45. doi: 10.1006/geno.1997.4984.

A comparison of expressed sequence tags (ESTs) to human genomic sequences.表达序列标签（ESTs）与人类基因组序列的比较。

Nucleic Acids Res. 1997 Apr 15;25(8):1626-32. doi: 10.1093/nar/25.8.1626.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验