Brocchieri Luciano, Kledal Thomas N, Karlin Samuel, Mocarski Edward S
Department of Mathematics, Stanford University, Stanford, CA 94305-2125, USA.
J Virol. 2005 Jun;79(12):7570-96. doi: 10.1128/JVI.79.12.7570-7596.2005.
Prediction of protein-coding regions and other features of primary DNA sequence have greatly contributed to experimental biology. Significant challenges remain in genome annotation methods, including the identification of small or overlapping genes and the assessment of mRNA splicing or unconventional translation signals in expression. We have employed a combined analysis of compositional biases and conservation together with frame-specific G+C representation to reevaluate and annotate the genome sequences of mouse and rat cytomegaloviruses. Our analysis predicts that there are at least 34 protein-coding regions in these genomes that were not apparent in earlier annotation efforts. These include 17 single-exon genes, three new exons of previously identified genes, a newly identified four-exon gene for a lectin-like protein (in rat cytomegalovirus), and 10 probable frameshift extensions of previously annotated genes. This expanded set of candidate genes provides an additional basis for investigation in cytomegalovirus biology and pathogenesis.
对蛋白质编码区及初级DNA序列的其他特征进行预测,极大地推动了实验生物学的发展。基因组注释方法仍面临重大挑战,包括小基因或重叠基因的识别以及对表达中的mRNA剪接或非常规翻译信号的评估。我们采用了成分偏差分析、保守性分析以及特定框架的G+C表示法相结合的方式,对小鼠和大鼠巨细胞病毒的基因组序列进行重新评估和注释。我们的分析预测,这些基因组中至少有34个蛋白质编码区在早期注释工作中并不明显。其中包括17个单外显子基因、先前已鉴定基因的3个新外显子、一个新鉴定的针对凝集素样蛋白的四外显子基因(在大鼠巨细胞病毒中),以及先前注释基因的10个可能的移码延伸。这一扩充的候选基因集为巨细胞病毒生物学和发病机制的研究提供了额外的基础。