Suppr超能文献

SPA:一种用于剪接比对的概率算法。

SPA: a probabilistic algorithm for spliced alignment.

作者信息

van Nimwegen Erik, Paul Nicodeme, Sheridan Robert, Zavolan Mihaela

机构信息

Biozentrum, University of Basel, Basel, Switzerland.

出版信息

PLoS Genet. 2006 Apr;2(4):e24. doi: 10.1371/journal.pgen.0020024. Epub 2006 Apr 28.

Abstract

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5' and 3' ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi.

摘要

近期大规模的cDNA测序工作表明,复杂的剪接变异模式是高等真核生物蛋白质组多样性的主要原因。为了准确描述剪接变体的全部情况,并深入了解可变剪接的机制,至关重要的是将cDNA非常准确地映射到它们各自的基因组上。目前用于cDNA与基因组比对的算法达不到所需的准确度,因为它们使用的是临时评分模型,无法正确权衡各种测序错误的可能性与不同基因结构的概率。在此,我们开发了一种用于cDNA与基因组比对的贝叶斯概率方法。根据内含子和外显子的长度以及剪接边界处的序列,为基因结构赋予先验概率。测序错误的似然模型考虑了测序过程中错配以及不同长度插入和缺失发生的速率。先验模型和似然模型的参数都可以从一组cDNA中自动估计,从而使我们的方法能够适应不同的生物体和实验程序。我们在一个快速的cDNA与基因组比对程序SPA中实现了我们的方法,并将其应用于超过100,000个全长小鼠cDNA的FANTOM3数据集和超过20,000个全长人类cDNA的数据集。与其他四个映射程序的结果比较表明,SPA产生的比对质量显著更高。特别是,SPA在剪接边界附近的比对质量以及SPA对cDNA 5'和3'末端的映射有了很大改进,使得能够更准确地识别转录本的起始和终止,并准确识别细微的剪接变异。最后,我们对人类数据集的剪接边界分析表明存在一个新的非经典剪接位点,我们在小鼠数据集中也发现了该位点。SPA软件包可在http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/199c/1449883/482094a5fbd3/pgen.0020024.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验