Department of Computer Science and Engineering, Texas A&M University, College Station, TX, USA.
Tuberculosis (Edinb). 2013 Jan;93(1):18-25. doi: 10.1016/j.tube.2012.11.012. Epub 2012 Dec 26.
Identification and correction of incorrect ORF start sites is important for a variety of experimental and analytical purposes, ranging from cloning to inference of operon structure. The genome of the H37Rv reference strain of Mycobacterium tuberculosis (Mtb) was originally annotated when it was first sequenced nearly 15 years ago. While this annotation has served the TB research community well as a standard of reference for over a decade, it has been demonstrated experimentally that the actual start sites for an estimated 5-10% of open reading frames differ from the annotation. In this paper, we present a comprehensive bioinformatic analysis of all 3989 ORFs (open reading frames) in the M. tuberculosis H37Rv genome. Our method combines information from comparative analysis (alignment to start sites of orthologs in other Actinobacteria), sequence conservation, "protein likeness", putative ribosome binding sites, and other data to identify translational start sites. The features are combined in a linear model that is trained on dataset of known start sites verified by mass spectrometry, with a cross-validated accuracy of 94%. The method can be viewed as an augmentation of Hidden Markov Model-based tools such as Glimmer and GeneMark by incorporating more information than just the raw genomic sequence to decide which position is the legitimate translational start site for each ORF. Using this analysis, we identify 269 genes that most likely need to be re-annotated, and identify the best alterative translational start site for each. These revised ORF definitions could be used in the reannotation of the H37Rv genome, as well as to prioritize genes for experimental start-site validation.
鉴定和纠正不正确的 ORF 起始位点对于各种实验和分析目的都很重要,从克隆到操纵子结构的推断。结核分枝杆菌(Mtb)H37Rv 参考菌株的基因组最初是在大约 15 年前首次测序时注释的。虽然在过去十年中,该注释作为 TB 研究社区的参考标准很好地服务了研究社区,但实验表明,估计有 5-10%的开放阅读框的实际起始位点与注释不同。在本文中,我们对结核分枝杆菌 H37Rv 基因组中的所有 3989 个 ORF(开放阅读框)进行了全面的生物信息学分析。我们的方法结合了比较分析(与其他放线菌的起始位点进行比对)、序列保守性、“蛋白质相似性”、假定的核糖体结合位点以及其他数据的信息,以鉴定翻译起始位点。这些特征结合在一个线性模型中,该模型是基于通过质谱验证的已知起始位点数据集进行训练的,交叉验证准确率为 94%。该方法可以看作是对基于隐马尔可夫模型的工具(如 Glimmer 和 GeneMark)的扩展,因为它不仅结合了原始基因组序列,还结合了更多信息来决定每个 ORF 的合法翻译起始位点。使用这种分析,我们确定了 269 个最有可能需要重新注释的基因,并为每个基因确定了最佳的替代翻译起始位点。这些修订后的 ORF 定义可用于 H37Rv 基因组的重新注释,以及对实验起始位点验证的基因进行优先级排序。