Zhang De-Li, Ji Liang, Li Yan-Da
Key Laboratory of the Ministry of Education on Bioinformatics, Institute of Bioinformatics, Department of Automation, School of Information Science and Technology, Tsinghua University, Beijing 100084, China.
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
We found that human genome coding regions annotated by computers have different kinds of many errors in public domain through homologous BLAST of our cloned genes in non-redundant (nr) database, including insertions, deletions or mutations of one base pair or a segment in sequences at the cDNA level, or different permutation and combination of these errors. Basically, we use the three means for validating and identifying some errors of the model genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS: (I) Evaluating the support degree of human EST clustering and draft human genome BLAST. (2) Preparation of chromosomal mapping of our verified genes and analysis of genomic organization of the genes. All of the exon/intron boundaries should be consistent with the GT/AG rule, and consensuses surrounding the splice boundaries should be found as well. (3) Experimental verification by RT-PCR of the in silico cloning genes and further by cDNA sequencing. And then we use the three means as reference: (1) Web searching or in silico cloning of the genes of different species, especially mouse and rat homologous genes, and thus judging the gene existence by ontology. (2) By using the released genes in public domain as standard, which should be highly homologous to our verified genes, especially the released human genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS, we try to clone each a highly homologous complete gene similar to the released genes in public domain according to the strategy we developed in this paper. If we can not get it, our verified gene may be correct and the released gene in public domain may be wrong. (3) To find more evidence, we verified our cloned genes by RT-PCR or hybrid technique. Here we list some errors we found from NCBI GENOME ANNOTATION PROJECT REFSEQs: (1) Insert a base in the ORF by mistake which causes the frame shift of the coding amino acid. In detail, abase in the ORF of a gene is a redundant insertion, which causes a reading frame shift in the translation of an alternative protein, such as LOC124919 is wrong form of C17 orf32 (with mouse and rat orthologs determined by us). (2) Put together by mistake (with force). This is a wrong assembly of non-relating cDNA segment, such as LOC147007 is wrong form of C17orf32. (3) Mistakenly insert a base or one section of cDNA in the ORF which causes it ending beforehand, only coding cDNA sequence of N-terminal amino acids, incomplete. For example, LOC123722 is wrong form of SPRYD1, and even the human hypothetical gene LOC126250 or PDCD5 is wrong form of our PDCD5 (TFAR19). (4) Incomplete, only coding cDNA sequence of C-terminal amino acids. For example, human LOC149076 and mouse LOC230761 are wrong form of our verified human ZNF362 and mouse Zfp362, respectively. (5) Incomplete, only coding one section of coding protein cDNA sequence of correct gene ORF, lacking N-terminal and C-terminal amino acids sequence, and at the same time, mistakenly anticipates the first non-initiation codon amino acid of the incomplete protein amino acid as the initiation codon, e.g. anticipating L as M. For example, LOC200084 is wrong form of ZNF362. (6) Mistakenly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However,these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone,and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. The articles published in the past did not clearly point out the existence of mistakes in the NCBI human gene mode reference sequence. At the Seventh International Human Genome Conference held in April 2002, we first published the researching result on this aspect in the communication form of Posterly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However, these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone, and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. (ABSTRACT TRUNCATED)
通过在非冗余(nr)数据库中对我们克隆的基因进行同源性BLAST分析,我们发现公共领域中计算机注释的人类基因组编码区域存在多种错误,包括cDNA水平上序列中一个碱基对或一段序列的插入、缺失或突变,或这些错误的不同排列组合。基本上,我们采用三种方法来验证和识别NCBI基因组注释项目参考序列(REFSEQS)中出现的一些模型基因错误:(I)评估人类EST聚类和人类基因组草图BLAST的支持程度。(2)对我们验证的基因进行染色体定位,并分析基因的基因组组织。所有外显子/内含子边界应与GT/AG规则一致,并且还应找到剪接边界周围的共识序列。(3)通过对电子克隆基因进行RT-PCR实验验证,并进一步进行cDNA测序。然后我们以三种方法为参考:(1)通过网络搜索或电子克隆不同物种的基因,特别是小鼠和大鼠的同源基因,从而通过本体论判断基因的存在。(2)以公共领域中发布的基因作为标准,这些基因应与我们验证的基因高度同源,特别是NCBI基因组注释项目参考序列中出现的已发布人类基因,我们尝试根据本文开发的策略克隆出与公共领域中已发布基因相似的高度同源的完整基因。如果我们无法得到它,我们验证的基因可能是正确的,而公共领域中发布的基因可能是错误的。(3)为了找到更多证据,我们通过RT-PCR或杂交技术验证我们克隆的基因。这里我们列出一些我们在NCBI基因组注释项目参考序列中发现的错误:(1)在开放阅读框(ORF)中错误插入一个碱基,导致编码氨基酸的移码。详细来说,基因ORF中的一个碱基是冗余插入,这会导致另一种蛋白质翻译时的阅读框移位,例如LOC124919是C17orf32的错误形式(我们确定了其小鼠和大鼠直系同源基因)。(2)错误拼接(强行拼接)。这是不相关cDNA片段的错误组装,例如LOC147007是C17orf32的错误形式。(3)在ORF中错误插入一个碱基或一段cDNA,导致其提前结束,只编码N端氨基酸的cDNA序列,不完整。例如,LOC123722是SPRYDl的错误形式,甚至人类假设基因LOC126250或PDCD5是我们的PDCD5(TFAR19)的错误形式。(4)不完整,只编码C端氨基酸的cDNA序列。例如,人类LOC149076和小鼠LOC230761分别是我们验证的人类ZNF362和小鼠Zfp362的错误形式。(5)不完整,只编码正确基因ORF的一段编码蛋白质cDNA序列,缺少N端和C端氨基酸序列,同时错误地将不完整蛋白质氨基酸的第一个非起始密码子氨基酸预测为起始密码子,例如将L预测为M。例如,LOC200084是ZNF362的错误形式。(6)在ORF中错误插入一个碱基或一段cDNA,错误地在插入前导致不必要的终止密码子,因此编码的蛋白质缺少氨基酸的第一部分。例如,GenBank登录号AL096883(基因座号HS323M22B)是经实验验证的人类NM_012263的错误形式,其小鼠直系同源基因已确定为BC0l0510。(7)可能将污染的基因组序列视为完整的基因cDNA序列,并预测所谓的单外显子基因,即使是真实的基因,在非常长的单外显子mRNA中也只有一个小的ORF,而在ORF起始密码子上部的同一相位确实存在终止密码子,没有其他特征符合该基因情况。例如,LOC91126是ZNF362的错误形式。(8)预测的基因只有ORF,其两端都没有EST证据。基于这个ORF,一个具有EST和人类基因组双重支持的完整基因cDNA(在ORF上部的同一相位有终止密码子)表明预测的ORF参考序列可能不正确。例如,LOC164395可能是新型人类基因bankit4590055的错误形式。(9)在有EST实验证据支持的人类基因组序列范围内预测了一个相似但较小的蛋白质编码基因,因此其他新预测的基因可能不正确。例如,LOC167563可能是CMYA5的错误形式。然而,通过使用我们的策略可以纠正或避免这些错误。这里我们详细举一个例子:将SPRYDl序列与人类假设基因LOC123722进行比较。LOC123722 cDNA中478 - 480位的TAA碱基是冗余的,这导致另一种蛋白质翻译时的阅读框移位。LOC123722的GTAAA冗余不被我们的实验克隆支持,几乎完全被人类EST比对拒绝,并且通过基因组GT/AG组织分析显示为下一个内含子序列。对SPRYDl的cDNA或基因组DNA序列的验证表明,由于预测程序,LOC123722在其ORF内有一个错误的终止密码子,因此不是完整的编码序列。综上所述,通过将生物信息学分析与实验验证相结合,我们发现通过在非冗余数据库中对我们克隆的基因进行BLAST分析,NCBI基因组注释项目参考序列中出现了至少九种错误,我们的策略有助于纠正这些错误,例如LOC14907、LOC200084和LOC91126(它们都应该是ZNF362,但却是ZNF362的三种不同错误形式),这是通过使用基因预测方法从NCBI重叠群NT_004511自动计算分析预测的三个模型参考序列,或者例如LOC124919和LOC147007(两者都应该是C17orf32,但却是C17orf32的两种不同错误形式),这是通过使用基因预测方法从NCBI重叠群NT_010808自动计算分析预测的两个模型参考序列。因此,正确识别和注释新型人类基因可能仍然是一项艰巨的任务,这可能需要很长时间才能完成。所以应谨慎使用计算机注释的人类基因组编码区域。过去发表的文章没有明确指出NCBI人类基因模式参考序列中存在错误。在2002年4月举行的第七届国际人类基因组会议上,我们首次以海报交流的形式发表了这方面的研究结果 在ORF中错误插入一个碱基或一段cDNA,错误地在插入前导致不必要的终止密码子,因此编码的蛋白质缺少氨基酸的第一部分。例如,GenBank登录号AL096883(基因座号HS3\(23M22B\))是经实验验证的人类NM_012263的错误形式,其小鼠直系同源基因已确定为BC0l0510。(7)可能将污染的基因组序列视为完整的基因cDNA序列,并预测所谓的单外显子基因,即使是真实的基因,在非常长的单外显子mRNA中也只有一个小的ORF,而在ORF起始密码子上部的同一相位确实存在终止密码子,没有其他特征符合该基因情况。例如,LOC\(91126\)是ZNF362的错误形式。(8)预测的基因只有ORF,其两端都没有EST证据。基于这个ORF,一个具有EST和人类基因组双重支持的完整基因cDNA(在ORF上部的同一相位有终止密码子)表明预测的ORF参考序列可能不正确。例如,LOC164395可能是新型人类基因bankit4590055的错误形式。(9)在有EST实验证据支持的人类基因组序列范围内预测了一个相似但较小的蛋白质编码基因,因此其他新预测的基因可能不正确。例如,LOC167563可能是CMYA5的错误形式。然而,通过使用我们的策略可以纠正或避免这些错误。这里我们详细举一个例子:将SPRYDl序列与人类假设基因LOC123722进行比较。LOC123722 cDNA中478 - 480位的TAA碱基是冗余的,这导致另一种蛋白质翻译时的阅读框移位。LOC123722的GTAAA冗余不被我们的实验克隆支持,几乎完全被人类EST比对拒绝,并且通过基因组GT/AG组织分析显示为下一个内含子序列。对SPRYDl的cDNA或基因组DNA序列的验证表明,由于预测程序,LOC123722在其ORF内有一个错误的终止密码子,因此不是完整的编码序列。综上所述,通过将生物信息学分析与实验验证相结合,我们发现通过在非冗余数据库中对我们克隆的基因进行BLAST分析,NCBI基因组注释项目参考序列中出现了至少九种错误,我们的策略有助于纠正这些错误,例如LOC14907、LOC200084和LOC91126(它们都应该是ZNF362,但却是ZNF362的三种不同错误形式),这是通过使用基因预测方法从NCBI重叠群NT_004511自动计算分析预测的三个模型参考序列,或者例如LOC124919和LOC147007(两者都应该是C17orf32,但却是C17orf32的两种不同错误形式),这是通过使用基因预测方法从NCBI重叠群NT_010808自动计算分析预测的两个模型参考序列。因此,正确识别和注释新型人类基因可能仍然是一项艰巨的任务,这可能需要很长时间才能完成。所以应谨慎使用计算机注释的人类基因组编码区域。(摘要截断)