利用基因表达序列分析（SAGE）进行基因鉴定的计算分析

Computational Analysis of Gene Identification with SAGE.

作者信息

Clark Terry, Lee Sanggyu, Ridgway Scott L, Wang San Ming

机构信息

Department of Computer Science, The University of Chicago, Chicago, IL 60637, USA.

出版信息

J Comput Biol. 2002;9(3):513-26. doi: 10.1089/106652702760138600.

DOI:10.1089/106652702760138600

PMID:12162890

Abstract

SAGE is one of the few techniques capable of uniformly probing gene expression at a genome level irrespective of mRNA abundance and without a priori knowledge of the transcripts present. However, individual SAGE tags can match many sequences in the reference database, complicating gene identification. We perform a baseline evaluation of gene identification with SAGE using UniGene Human as the reference database by analyzing 1) the distributions of tags for various length tag sets formed for UniGene Human and 2) the tag-to-sequence mapping using a SAGE tag set consisting of 37,522 tags derived from human myeloid cells. The extensive multiplicity of the dbEST component of UniGene significantly detracts from gains that might be expected by extending tags within the scope of the SAGE protocol. In order to achieve reasonable sequence specificity for gene identification with the content of the commonly used UniGene sequence collection, tags on the order of hundreds of bases in length are required. One way to produce tags of such lengths is with GLGI, which extends SAGE tags to the 3' end of cDNA. We show that the longer sequences produced by GLGI relieve significantly the multiple match condition. In the myeloid sample, we also found a correlation between multiple match severity and high copy number. We extrapolate these findings, providing insights into the use of UniGene Human as a reference for gene identification.

摘要

SAGE是少数几种能够在不考虑mRNA丰度且无需事先了解存在的转录本的情况下，在基因组水平上统一探测基因表达的技术之一。然而，单个SAGE标签可以与参考数据库中的许多序列匹配，这使得基因识别变得复杂。我们以UniGene Human作为参考数据库，通过分析1）为UniGene Human形成的各种长度标签集的标签分布，以及2）使用由来自人类髓样细胞的37,522个标签组成的SAGE标签集进行标签到序列的映射，对使用SAGE进行基因识别进行了基线评估。UniGene的dbEST组件的广泛多重性显著降低了在SAGE协议范围内扩展标签可能预期获得的收益。为了利用常用的UniGene序列集合的内容实现合理的基因识别序列特异性，需要数百个碱基长度的标签。产生这种长度标签的一种方法是使用GLGI，它将SAGE标签延伸到cDNA的3'端。我们表明，GLGI产生的较长序列显著缓解了多重匹配情况。在髓样样本中，我们还发现多重匹配严重程度与高拷贝数之间存在相关性。我们推断这些发现，为使用UniGene Human作为基因识别参考提供了见解。