利用基因组组装对EST数据进行评估。

Evaluation of EST-data using the genome assembly.

作者信息

Murray Christian G, Larsson Thomas P, Hill Tobias, Björklind Rikard, Fredriksson Robert, Schiöth Helgi B

机构信息

Department of Neuroscience, Uppsala University, BMC Box 539, 751 24 Uppsala, Sweden.

出版信息

Biochem Biophys Res Commun. 2005 Jun 17;331(4):1566-76. doi: 10.1016/j.bbrc.2005.04.070.

DOI:10.1016/j.bbrc.2005.04.070

PMID:15883052

Abstract

Using expressed sequence tag (EST) data for genomewide studies requires thorough understanding of the nature of the problems that are related to handling these sequences. We investigated how EST clustering performs when the genome is used as guidance as compared to pairwise sequence alignment methods. We show that clustering with the genome as a template outperforms sequence similarity methods used to create other EST clusters, such as the UniGene set, in respect to the extent ESTs originating from the same transcriptional unit are separated into disjunct clusters. Using our approach, approximately 80% of the RefSeq genes were represented by a single EST cluster and 20% comprised of two or more EST clusters. In contrast, approximately 25% of all RefSeq genes were found to be represented by a single cluster for the UniGene clustering method. The approach minimizes the risk for overestimations due to the amount of disjunct clusters originating from the same transcript. We have also investigated the quality of EST-data by aligning ESTs to the genome. The results show how many ESTs are not adequately trimmed in respect of vector sequences and low quality regions. Moreover, we identified important problems related to ESTs aligned to the genome using BLAT, such as inferring splice junctions, and explained this aspect by simulations with synthetic data. EST-clusters created with the method are available upon request from the authors.

摘要

在全基因组研究中使用表达序列标签（EST）数据，需要深入了解与处理这些序列相关的问题的本质。我们研究了与成对序列比对方法相比，以基因组作为指导时EST聚类的表现。我们发现，以基因组为模板进行聚类，在将源自同一转录单元的EST分离到不同聚类的程度方面，优于用于创建其他EST聚类（如UniGene集）的序列相似性方法。使用我们的方法，约80%的RefSeq基因由单个EST聚类代表，20%由两个或更多EST聚类组成。相比之下，对于UniGene聚类方法，所有RefSeq基因中约25%由单个聚类代表。该方法将因源自同一转录本的不同聚类数量而导致高估的风险降至最低。我们还通过将EST与基因组比对来研究EST数据的质量。结果显示了在载体序列和低质量区域方面有多少EST没有得到充分修剪。此外，我们确定了使用BLAT将EST与基因组比对时的重要问题，如推断剪接接头，并通过合成数据模拟解释了这一方面。通过该方法创建的EST聚类可应作者要求提供。