Département d'Informatique et de Recherche Opérationnelle (DIRO), Université de Montréal, Montréal, QC CP 6128, Canada.
School of Computer Science, McGill University, McConnell Engineering Bldg., Montréal, QC H3A 0E9, Canada.
Bioinformatics. 2017 Nov 1;33(21):3331-3339. doi: 10.1093/bioinformatics/btx421.
Codon reassignments have been reported across all domains of life. With the increasing number of sequenced genomes, the development of systematic approaches for genetic code detection is essential for accurate downstream analyses. Three automated prediction tools exist so far: FACIL, GenDecoder and Bagheera; the last two respectively restricted to metazoan mitochondrial genomes and CUG reassignments in yeast nuclear genomes. These tools can only analyze a single genome at a time and are often not followed by a validation procedure, resulting in a high rate of false positives.
We present CoreTracker, a new algorithm for the inference of sense-to-sense codon reassignments. CoreTracker identifies potential codon reassignments in a set of related genomes, then uses statistical evaluations and a random forest classifier to predict those that are the most likely to be correct. Predicted reassignments are then validated through a phylogeny-aware step that evaluates the impact of the new genetic code on the protein alignment. Handling simultaneously a set of genomes in a phylogenetic framework, allows tracing back the evolution of each reassignment, which provides information on its underlying mechanism. Applied to metazoan and yeast genomes, CoreTracker significantly outperforms existing methods on both precision and sensitivity.
CoreTracker is written in Python and available at https://github.com/UdeM-LBIT/CoreTracker.
Supplementary data are available at Bioinformatics online.
密码子重排已在所有生命领域中被报道。随着测序基因组数量的增加,开发系统的遗传密码检测方法对于准确的下游分析至关重要。目前存在三种自动化预测工具:FACIL、GenDecoder 和 Bagheera;后两者分别局限于后生动物线粒体基因组和酵母核基因组中的 CUG 重排。这些工具一次只能分析一个基因组,而且通常没有后续验证过程,导致假阳性率很高。
我们提出了 CoreTracker,这是一种用于推断同义密码子重排的新算法。CoreTracker 在一组相关基因组中识别潜在的密码子重排,然后使用统计评估和随机森林分类器来预测最有可能正确的那些。预测的重排然后通过一个基于系统发育的步骤进行验证,该步骤评估新遗传密码对蛋白质比对的影响。在系统发育框架中同时处理一组基因组,允许追溯每个重排的进化,从而提供有关其潜在机制的信息。将 CoreTracker 应用于后生动物和酵母基因组,在精度和灵敏度方面均显著优于现有方法。
CoreTracker 是用 Python 编写的,可在 https://github.com/UdeM-LBIT/CoreTracker 上获得。
补充数据可在《生物信息学》在线获得。