Suppr超能文献

纳入非规范内含子的剪接位点概率模型可改善植物基因结构预测。

Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants.

作者信息

Sparks Michael E, Brendel Volker

机构信息

Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA.

出版信息

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii20-30. doi: 10.1093/bioinformatics/bti1205.

Abstract

MOTIVATION

The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide at their 5'-terminus and an AG dinucleotide at their 3' end. About 1-2% of introns are non-canonical, with the most abundant subtype of non-canonical introns being characterized by GC and AG dinucleotides at their 5'- and 3'-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether. With present amounts of genome and transcript data, it is now possible to apply statistical methodology to non-canonical splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns can enhance gene structure prediction accuracy.

RESULTS

Our results indicate that the incorporation of non-canonical splice site models yields dramatic improvements in annotating genes containing GC-AG and AT-AC non-canonical introns. Comparison of models shows differences between monocot and dicot species, but also suggests GC intron-specific biases independent of taxonomic clade. We also present evidence that GC-AG introns occur preferentially in genes with atypically high exon counts.

AVAILABILITY

Source code for the updated versions of GeneSeqer and SplicePredictor (distributed with the GeneSeqer code) isavailable at http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html. Web servers for Arabidopsis, rice and other plant species are accessible at http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/AtGDBgs.cgi, http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/OsGDBgs.cgi and http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi, respectively. A SplicePredictor web server is available at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi. Software to generate training data and parameterizations for Bayesian splice site models is available at http://gremlin1.gdcb.iastate.edu/~volker/SB05B/BSSM4GSQ/

摘要

动机

高等真核生物蛋白质编码基因中的绝大多数内含子在其5'端有GT二核苷酸,在其3'端有AG二核苷酸。约1 - 2%的内含子是非典型的,非典型内含子中最丰富的亚型分别在其5'和3'末端以GC和AG二核苷酸为特征。目前大多数基因预测软件,无论是基于从头开始还是剪接比对方法,都不包括非典型内含子的显式模型,或者可能完全排除对它们的预测。利用目前的基因组和转录本数据量,现在可以将统计方法应用于非典型剪接位点预测。我们采用了这样一种方法,并描述了拟南芥和水稻GC供体剪接位点模型的训练和实现,目的是探索非典型内含子的特定建模是否可以提高基因结构预测的准确性。

结果

我们的结果表明,纳入非典型剪接位点模型在注释包含GC - AG和AT - AC非典型内含子的基因方面有显著改进。模型比较显示了单子叶植物和双子叶植物物种之间的差异,但也表明了独立于分类进化枝的GC内含子特异性偏差。我们还提供证据表明,GC - AG内含子优先出现在外显子数量异常高的基因中。

可用性

GeneSeqer和SplicePredictor(与GeneSeqer代码一起分发)的更新版本的源代码可在http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html获得。拟南芥、水稻和其他植物物种的网络服务器分别可在http://www.plantgdb.org/PlantGDB - cgi/GeneSeqer/AtGDBgs.cgi、http://www.plantgdb.org/PlantGDB - cgi/GeneSeqer/OsGDBgs.cgi和http://www.plantgdb.org/PlantGDB - cgi/GeneSeqer/PlantGDBgs.cgi访问。SplicePredictor网络服务器可在http://bioinformatics.iastate.edu/cgi - bin/sp.cgi获得。用于生成贝叶斯剪接位点模型的训练数据和参数化的软件可在http://gremlin1.gdcb.iastate.edu/~volker/SB05B/BSSM4GSQ/获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验