Gotoh O
Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan.
Bioinformatics. 2000 Mar;16(3):190-202. doi: 10.1093/bioinformatics/16.3.190.
Locating protein-coding exons (CDSs) on a eukaryotic genomic DNA sequence is the initial and an essential step in predicting the functions of the genes embedded in that part of the genome. Accurate prediction of CDSs may be achieved by directly matching the DNA sequence with a known protein sequence or profile of a homologous family member(s).
A new convention for encoding a DNA sequence into a series of 23 possible letters (translated codon or tron code) was devised to improve this type of analysis. Using this convention, a dynamic programming algorithm was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frameshift errors, coding potentials, and translational initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient (CC) was about 95% at the nucleotide level for the 288 genes tested, and 97. 0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. We also propose a strategy to improve the accuracy of prediction for a set of paralogous genes by means of iterative gene prediction and reconstruction of the reference profile derived from the predicted sequences.
The source codes for the program 'aln' written in ANSI-C and the test data will be available via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/saitama-cc.
在真核生物基因组DNA序列上定位蛋白质编码外显子(CDS)是预测基因组该部分所嵌入基因功能的初始且关键步骤。通过将DNA序列与已知蛋白质序列或同源家族成员的序列谱直接匹配,可实现对CDS的准确预测。
设计了一种将DNA序列编码为一系列23种可能字母(翻译密码子或tron码)的新方法,以改进此类分析。使用该方法,开发了一种动态规划算法来比对DNA序列与蛋白质序列或序列谱,从而使拼接和翻译后的序列与参考序列(与标准蛋白质序列比对允许存在长间隙的情况相同)实现最优匹配。目标函数还考虑了移码错误、编码潜能以及翻译起始、终止和剪接信号。该方法在已知结构的秀丽隐杆线虫基因上进行了测试。在所测试的288个基因中,以相关系数(CC)衡量的预测准确性在核苷酸水平约为95%,对于其产物与最接近的同源物共享超过30%相同氨基酸的170个基因,预测准确性为97.0%。我们还提出了一种策略,通过迭代基因预测和从预测序列重建参考序列谱来提高一组旁系同源基因的预测准确性。
用ANSI-C编写的程序“aln”的源代码和测试数据将通过匿名FTP在ftp.genome.ad.jp/pub/genomenet/saitama-cc上获取。