Suppr超能文献

将长链SAGE标签定位到人类基因组后出现的意外观察结果。

Unexpected observations after mapping LongSAGE tags to the human genome.

作者信息

Keime Céline, Sémon Marie, Mouchiroud Dominique, Duret Laurent, Gandrillon Olivier

机构信息

Université de Lyon, Lyon, France.

出版信息

BMC Bioinformatics. 2007 May 15;8:154. doi: 10.1186/1471-2105-8-154.

Abstract

BACKGROUND

SAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAGE tags obtained from all public libraries. We focused mainly on tags that do not map to known transcripts.

RESULTS

Using a published error rate in SAGE libraries, we first removed the tags likely to result from sequencing errors. We then observed that an unexpectedly large number of the remaining tags still did not match the genome sequence. Some of these correspond to parts of human mRNAs, such as polyA tails, junctions between two exons and polymorphic regions of transcripts. Another non-negligible proportion can be attributed to contamination by murine transcripts and to residual sequencing errors. After filtering out our data with these screens to ensure that our dataset is highly reliable, we studied the tags that map once to the genome. 31% of these tags correspond to unannotated transcripts. The others map to known transcribed regions, but many of them (nearly half) are located either in antisense or in new variants of these known transcripts.

CONCLUSION

We performed a comprehensive study of all publicly available human LongSAGE tags, and carefully verified the reliability of these data. We found the potential origin of many tags that did not match the human genome sequence. The properties of the remaining tags imply that the level of sequencing error may have been under-estimated. The frequency of tags matching once the genome sequence but not in an annotated exon suggests that the human transcriptome is much more complex than shown by the current human genome annotations, with many new splicing variants and antisense transcripts. SAGE data is appropriate to map new transcripts to the genome, as demonstrated by the high rate of cross-validation of the corresponding tags using other methods.

摘要

背景

基因表达序列分析(SAGE)已被广泛用于研究已知转录本的表达,但用于注释新转录区域的情况则少得多。长SAGE产生的标签足够长,能够可靠地映射到全基因组序列。在此,我们利用这一特性研究了从所有公共文库中获得的人类长SAGE标签的位置。我们主要关注那些不能映射到已知转录本的标签。

结果

利用已发表的SAGE文库错误率,我们首先去除了可能由测序错误产生的标签。然后我们观察到,剩余的标签中仍有出乎意料的大量标签与基因组序列不匹配。其中一些对应于人类mRNA的部分,如多聚腺苷酸尾、两个外显子之间的连接以及转录本的多态性区域。另一个不可忽视的比例可归因于鼠转录本的污染和残留的测序错误。在用这些筛选条件过滤我们的数据以确保我们的数据集高度可靠之后,我们研究了那些在基因组中只映射一次的标签。其中31%的标签对应于未注释的转录本。其他的映射到已知的转录区域,但其中许多(近一半)位于这些已知转录本的反义链或新变体中。

结论

我们对所有公开可用的人类长SAGE标签进行了全面研究,并仔细验证了这些数据的可靠性。我们发现了许多与人类基因组序列不匹配的标签的潜在来源。其余标签的特性表明测序错误率可能被低估了。那些在基因组序列中只映射一次但不在注释外显子中的标签的频率表明,人类转录组比当前人类基因组注释所显示的要复杂得多,存在许多新的剪接变体和反义转录本。正如使用其他方法对相应标签进行的高交叉验证所证明的那样,SAGE数据适合于将新转录本映射到基因组上。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5691/1884178/788fa7943fa2/1471-2105-8-154-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验