Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada.
BMC Mol Biol. 2010 Dec 10;11:96. doi: 10.1186/1471-2199-11-96.
Despite extensive efforts devoted to predicting protein-coding genes in genome sequences, many bona fide genes have not been found and many existing gene models are not accurate in all sequenced eukaryote genomes. This situation is partly explained by the fact that gene prediction programs have been developed based on our incomplete understanding of gene feature information such as splicing and promoter characteristics. Additionally, full-length cDNAs of many genes and their isoforms are hard to obtain due to their low level or rare expression. In order to obtain full-length sequences of all protein-coding genes, alternative approaches are required.
In this project, we have developed a method of reconstructing full-length cDNA sequences based on short expressed sequence tags which is called sequence tag-based amplification of cDNA ends (STACE). Expressed tags are used as anchors for retrieving full-length transcripts in two rounds of PCR amplification. We have demonstrated the application of STACE in reconstructing full-length cDNA sequences using expressed tags mined in an array of serial analysis of gene expression (SAGE) of C. elegans cDNA libraries. We have successfully applied STACE to recover sequence information for 12 genes, for two of which we found isoforms. STACE was used to successfully recover full-length cDNA sequences for seven of these genes.
The STACE method can be used to effectively reconstruct full-length cDNA sequences of genes that are under-represented in cDNA sequencing projects and have been missed by existing gene prediction methods, but their existence has been suggested by short sequence tags such as SAGE tags.
尽管在基因组序列中预测蛋白质编码基因方面做出了广泛的努力,但许多真实的基因尚未被发现,许多现有的基因模型在所有已测序的真核生物基因组中并不准确。这种情况部分可以解释为基因预测程序是基于我们对基因特征信息(如剪接和启动子特征)的不完全了解而开发的。此外,由于其低水平或稀有表达,许多基因的全长 cDNA 及其异构体难以获得。为了获得所有蛋白质编码基因的全长序列,需要采用替代方法。
在本项目中,我们开发了一种基于短表达序列标签重建全长 cDNA 序列的方法,称为基于序列标签的 cDNA 末端扩增(STACE)。表达标签被用作在两轮 PCR 扩增中检索全长转录物的锚点。我们已经证明了 STACE 在使用从线虫 cDNA 文库的一系列基因表达序列分析 (SAGE) 中挖掘的表达标签来重建全长 cDNA 序列中的应用。我们已经成功地应用 STACE 恢复了 12 个基因的序列信息,其中两个基因发现了异构体。STACE 成功地用于恢复这 7 个基因的全长 cDNA 序列。
STACE 方法可有效重建 cDNA 测序项目中代表性不足且被现有基因预测方法遗漏的基因的全长 cDNA 序列,但这些基因的存在已被 SAGE 标签等短序列标签所提示。