Rè Matteo, Mignone Flavio, Iacono Michele, Grillo Giorgio, Liuni Sabino, Pesole Graziano
Dipartimento di Scienze Biomolecolari e Biotecnologie, via Celoria 26, 20133 Milano, Italy.
Gene. 2006 Jan 3;365:35-40. doi: 10.1016/j.gene.2005.09.041. Epub 2005 Dec 15.
We present here a novel methodology for the identification of genome regions potentially spanning one or more protein coding genes. It is based on the detection of clusters of conserved sequence tags whose evolutionary dynamics, based on the observation of an excess bias of synonymous substitutions at nucleotide level and of conservative replacements at protein level, suggests a likely protein coding role. A benchmark test carried out on a 236 Mbp of human-mouse syntenic regions from human chromosomes 15, 21 and 22 identified 25 CST clusters potentially containing unannotated genes. A further annotation update of the human genome assembly revealed that 11/25 clusters actually contained a total of 20 validated genes and 10 of the remaining 14 clusters had several experimental evidence in support of the presence of protein coding genes. These findings demonstrate the effectiveness and high prediction reliability of the proposed methodology which could specifically be applied to the annotation of novel genome sequences.
我们在此展示一种用于识别可能跨越一个或多个蛋白质编码基因的基因组区域的新方法。它基于对保守序列标签簇的检测,基于核苷酸水平上同义替换的过度偏差以及蛋白质水平上保守替换的观察,其进化动态表明可能具有蛋白质编码作用。对来自人类15号、21号和22号染色体的236兆碱基对的人鼠同线区域进行的基准测试,识别出25个可能包含未注释基因的保守序列标签簇。人类基因组组装的进一步注释更新显示,25个簇中的11个实际上总共包含20个经过验证的基因,其余14个簇中的10个有若干实验证据支持蛋白质编码基因的存在。这些发现证明了所提出方法的有效性和高预测可靠性,该方法可特别应用于新基因组序列的注释。