Nomura N, Miyajima N, Sazuka T, Tanaka A, Kawarabayasi Y, Sato S, Nagase T, Seki N, Ishikawa K, Tabata S
Institute of Gerontology, Nippon Medical School, Kanagawa, Japan.
DNA Res. 1994;1(1):27-35. doi: 10.1093/dnares/1.1.27.
We established a protocol for the prediction of the coding sequences of unidentified human genes based on the double selection and sequence analysis of cDNA clones with inserts carrying unreported 5'-terminal sequences and with insert sizes corresponding to nearly full-length transcripts. By applying the protocol, cDNA clones with inserts longer than 2 kb were isolated from a cDNA library of human immature myeloid cell line KG-1, and the coding sequences of 40 new genes were predicted. A computer search of the sequences indicated that 20 genes contained sequences similar to known genes in the GenBank/EMBL databases. The sequences of the remaining 20 genes were entirely new, and characteristic protein motifs or domains were identified in 32 genes. Other sequence features noted were that the coding sequences of 23 genes were followed by relatively long stretches of 3'-untranslated sequences and that 5 genes contained repetitive sequences in their 3'-untranslated regions. The chromosomal location of these genes has been determined. By increasing the scale of the above analysis, the coding sequences of many unidentified genes can be predicted.
我们建立了一个基于双重筛选和对带有未报道的5'-末端序列插入片段且插入片段大小对应于近乎全长转录本的cDNA克隆进行序列分析的方案,用于预测未鉴定的人类基因的编码序列。通过应用该方案,从人类未成熟髓系细胞系KG-1的cDNA文库中分离出插入片段长于2 kb的cDNA克隆,并预测了40个新基因的编码序列。对这些序列进行计算机搜索表明,20个基因包含与GenBank/EMBL数据库中已知基因相似的序列。其余20个基因的序列是全新的,并且在32个基因中鉴定出了特征性蛋白质基序或结构域。其他注意到的序列特征是,23个基因的编码序列之后是相对较长的3'-非翻译序列片段,并且5个基因在其3'-非翻译区域包含重复序列。这些基因的染色体定位已经确定。通过扩大上述分析的规模,可以预测许多未鉴定基因的编码序列。