Rivals Eric, Boureux Anthony, Lejeune Mireille, Ottones Florence, Pérez Oscar Pecharromàn, Tarhio Jorma, Pierrat Fabien, Ruffle Florence, Commes Thérèse, Marti Jacques
Laboratoire d'Informatique, de Robotique et de Microélectronique, UMR 5506 CNRS-Université de Montpellier II, 161 rue Ada, 34392 Montpellier 05, France.
Nucleic Acids Res. 2007;35(17):e108. doi: 10.1093/nar/gkm495. Epub 2007 Aug 20.
Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation.
对数百万个表达的基因特征(标签)进行分析后发现,不同序列的数量不断增加,大大超过了哺乳动物基因组中注释基因的数量。基因表达系列分析(SAGE)能够揭示从以前未被识别的染色体区域转录而来的新的聚腺苷酸RNA。然而,传统的SAGE标签太短,无法在大型基因组中明确识别唯一的位点。在此,我们设计了一种新策略,使标签锚定在cDNA的两个不同限制性位点上。然后,通过串联的两个SAGE标签以及这些标签位点之间基因组上读取的跨越序列,初步定义新的转录本。在开发出一种定位这些标签界定的基因组序列(TDGS)的新算法后,我们首先利用从单个RNA样本并行构建的两个SAGE文库,验证了其识别已知基因的能力以及揭示新转录本的能力。我们的算法证明足够快速,可以大规模地试验这一策略。然后,我们收集并处理了完整的人类SAGE标签集,以预测未知的转录本。与平铺阵列数据的交叉验证表明,这些TDGS中有47%与转录活性区域重叠。我们的方法为复杂转录组注释提供了一种新的补充方法。