Zhang De-Li, Li Yan-Da, Ji Liang
Key Laboratory of Ministry of Education on Bioinformatics, Institute of Bioinformatics, Department of Automation, School of Information Science and Technology, Tsinghua University, Beijing 100084, China.
Yi Chuan Xue Bao. 2004 Apr;31(4):325-34.
Found that there exist many mistakes in the REFSEQ issued in the genome annotation project of NCBI, the result of which indicates that people be cautious in using REFSEQ database in NCBI. By adopting the technical route combining bioinformatics analysis and experimental verification, through the comparison of the cloned genes in the non-redundant database, we found that there were many mistakes in the computer annotation human genome coding sequences that were issued on the internet. First we quoted nine wrong types of novel human genes anticipated by NCBI GENOME Annotation Project. Here we give one example in detail: (1) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC124919. LOC123722 is a modified sequence of C17orf32 cDNA with an inserted G between 406 -407 nucleotides. The base G in the 401 position of LOC123722 cDNA is a redundant insert, which causes a reading frame shift in the translation of an alternative protein. This inserted G has not been found in our experimental clone, and is fully rejected by human EST alignment, and is shown as a redundance by genomic GT/AG organization analysis. (2) Comparison of the sequences between novel human gene C17orf32 and hypothetical human gene LOC147007. C17orf32 gene (ORF from 31 to 657 nucleotides) is located on human chromosome 17(Accession No. NT_010808.7), and is only linked with a hypothetical human gene LOC147007 (ORF from 55 to 435 nucleotides) at present. This hypothetical human gene sequence has not been verified by experiment, and is a wrong form of our verified C17orf32 gene. The full-length 1 679 bp cDNA sequence of C17orf32 exhibits overall homology to that of LOC147007 of 625 bp mRNA, with matching percentage of 37% in 36% of total window over the full-length nucleotide, especially 121 approximately 366 bp of LOC147007 is just the same as 316 approximately 561 bp of C17orf32. Thus, the 126 aa protein encoded by XP_097165 of LOC147007 exhibits overall homology to the 208 aa protein encoded by C17orf32, with matching percentage of 50% in 48% of total window over the full-length protein, especially 23 approximately 104 aa of XP_097165 is just the same as 96 approximately 177 aa of C17orf32 protein. Both flanking regions of LOC147007 outside the same ORF central part are wrong assembly of non-relative cDNA. In addition, we have in silico cloned a novel mouse gene, ORF32 (open reading frame 32) with TPA accession number of BK000258, which is the mouse ortholog of human C17orf32. Our strategy is helpful in both finding out more novel human genes and correcting the mistakes in the REFSEQs issued by NCBI genome annnotation project. For example, we adopted the gene anticipating method, through automatic calculation and analysis, anticipated two modes reference sequences (LOC124919 and LOC147007) from NCBI contig NT_ 010808. Both of them should be C17orf32, but the fact is that both of them are various wrong forms of C17orf32, respectively are the first type and second type of mistakes. Another example, we adopted gene anticipation method, through automatic calculation and analysis, anticipated three modes reference sequences (LOC14907, LOC200084 and LOC91126) from NCBI contig NT_004511 which really are one type of gene of ZNF362, but submitted three different wrong forms of ZNF362, respectively are: the fourth, fifth, and seventh type of mistakes. We can correct or avoid the currently wrong human genome coding sequence by using in silico clone and combining experimental verification. People should be cautious in treating the computer's annotation which may exist all type of wrong human genome coding sequences. The correct identification and annotation of the novel human genes still remain to be a long and arduous task.
发现在NCBI基因组注释项目发布的REFSEQ中存在许多错误,结果表明人们在使用NCBI中的REFSEQ数据库时应谨慎。通过采用生物信息学分析与实验验证相结合的技术路线,通过对非冗余数据库中克隆基因的比较,我们发现在互联网上发布的计算机注释人类基因组编码序列中存在许多错误。首先,我们列举了NCBI基因组注释项目预测的9种错误类型的新型人类基因。在此详细给出一个例子:(1)新型人类基因C17orf32与假设人类基因LOC124919之间的序列比较。LOC123722是C17orf32 cDNA的修饰序列,在406 - 407核苷酸之间插入了一个G。LOC123722 cDNA第401位的碱基G是一个多余的插入,导致替代蛋白翻译时阅读框移位。我们的实验克隆中未发现这个插入的G,人类EST比对完全排除了它,并且通过基因组GT/AG组织分析显示为冗余。(2)新型人类基因C17orf32与假设人类基因LOC147007之间的序列比较。C17orf32基因(开放阅读框从31到657核苷酸)位于人类17号染色体上(登录号NT_010808.7),目前仅与一个假设人类基因LOC147007(开放阅读框从55到435核苷酸)相关联。这个假设人类基因序列尚未经过实验验证,是我们验证的C17orf32基因的错误形式。C17orf32的全长1679 bp cDNA序列与LOC147007的625 bp mRNA整体同源,在全长核苷酸的36%的总窗口中匹配百分比为37%,特别是LOC147007的121至366 bp与C17orf32的316至561 bp完全相同。因此,LOC147007的XP_097165编码的126 aa蛋白与C17orf32编码的208 aa蛋白整体同源,在全长蛋白的48%的总窗口中匹配百分比为50%,特别是XP_097165的23至104 aa与C17orf32蛋白的96至177 aa完全相同。LOC147007在相同开放阅读框中心部分之外的两侧区域是非相关cDNA的错误组装。此外,我们通过电子克隆了一个新型小鼠基因,开放阅读框32(ORF32),TPA登录号为BK000258,它是人类C17orf32的小鼠直系同源基因。我们的策略有助于发现更多新型人类基因并纠正NCBI基因组注释项目发布的REFSEQs中的错误。例如,我们采用基因预测方法,通过自动计算和分析,从NCBI重叠群NT_ 010808预测了两个模式参考序列(LOC124919和LOC147007)。它们都应该是C17orf32,但事实是它们都是C17orf32的不同错误形式,分别是第一类和第二类错误。另一个例子,我们采用基因预测方法,通过自动计算和分析,从NCBI重叠群NT_004511预测了三个模式参考序列(LOC14907、LOC200084和LOC91126),它们实际上是ZNF362的一种基因,但提交了ZNF362的三种不同错误形式,分别是:第四类、第五类和第七类错误。我们可以通过电子克隆并结合实验验证来纠正或避免当前错误的人类基因组编码序列。人们在对待可能存在各种错误人类基因组编码序列的计算机注释时应谨慎。新型人类基因的正确鉴定和注释仍然是一项长期而艰巨的任务。