Krause Lutz, Diaz Naryttza N, Bartels Daniela, Edwards Robert A, Pühler Alfred, Rohwer Forest, Meyer Folker, Stoye Jens
Bielefeld University, Center for Biotechnology (CeBiTec), D-33594 Bielefeld, Germany.
Bioinformatics. 2006 Jul 15;22(14):e281-9. doi: 10.1093/bioinformatics/btl247.
Novel sequencing techniques can give access to organisms that are difficult to cultivate using conventional methods. When applied to environmental samples, the data generated has some drawbacks, e.g. short length of assembled contigs, in-frame stop codons and frame shifts. Unfortunately, current gene finders cannot circumvent these difficulties. At the same time, the automated prediction of genes is a prerequisite for the increasing amount of genomic sequences to ensure progress in metagenomics.
We introduce a novel gene finding algorithm that incorporates features overcoming the short length of the assembled contigs from environmental data, in-frame stop codons as well as frame shifts contained in bacterial sequences. The results show that by searching for sequence similarities in an environmental sample our algorithm is capable of detecting a high fraction of its gene content, depending on the species composition and the overall size of the sample. The method is valuable for hunting novel unknown genes that may be specific for the habitat where the sample is taken. Finally, we show that our algorithm can even exploit the limited information contained in the short reads generated by 454 technology for the prediction of protein coding genes.
The program is freely available upon request.
新型测序技术能够获取那些难以用传统方法培养的生物体。当应用于环境样本时,所产生的数据存在一些缺陷,例如组装的重叠群长度较短、框内终止密码子和移码。不幸的是,当前的基因预测程序无法克服这些困难。与此同时,基因的自动预测是日益增加的基因组序列的一个先决条件,以确保宏基因组学取得进展。
我们引入了一种新型基因预测算法,该算法整合了一些特征,克服了环境数据中组装重叠群长度较短、细菌序列中存在的框内终止密码子以及移码等问题。结果表明,通过在环境样本中搜索序列相似性,我们的算法能够检测出其中很大一部分基因内容,这取决于样本的物种组成和总体大小。该方法对于寻找可能特定于样本采集栖息地的新型未知基因很有价值。最后,我们表明我们的算法甚至可以利用454技术产生的短读段中包含的有限信息来预测蛋白质编码基因。
该程序可根据请求免费获取。