Jikuya Hiroyuki, Takano Jun, Kikuno Reiko, Hirosawa Makoto, Nagase Takahiro, Nomura Nobuo, Ohara Osamu
Kazusa DNA Research Institute, 2-6-7 Kazusa-kamatari, Chiba 292-0818, Japan.
DNA Res. 2003 Feb 28;10(1):49-57. doi: 10.1093/dnares/10.1.49.
To accumulate information on the coding sequences (CDSs) of unidentified genes, we have conducted a sequencing project of human long cDNA clones. Both the end sequences of approximately 10,000 cDNA clones from two size-fractionated human spleen cDNA libraries (average sizes of 4.5 kb and 5.6 kb) were determined by single-pass sequencing to select cDNAs with unidentified sequences. We herein present the entire sequences of 81 cDNA clones, most of which were selected by two approaches based on their protein-coding potentialities in silico: Fifty-eight cDNA clones were selected as those having protein-coding potentialities at the 5'-end of single-pass sequences by applying the GeneMark analysis; and 20 cDNA clones were selected as those expected to encode proteins larger than 100 amino acid residues by analysis of the human genome sequences flanked by both the end sequences of cDNAs using the GENSCAN gene prediction program. In addition to these newly identified cDNAs, three cDNA clones were isolated by colony hybridization experiments using probes corresponding to known gene sequences since these cDNAs are likely to contain considerable amounts of new information regarding the genes already annotated. The sequence data indicated that the average sizes of the inserts and corresponding CDSs of cDNA clones analyzed here were 5.0 kb and 2.0 kb (670 amino acid residues), respectively. From the results of homology and motif searches against the public databases, functional categories of the 29 predicted gene products could be assigned; 86% of these predicted gene products (25 gene products) were classified into proteins relating to cell signaling/communication, nucleic acid management, and cell structure/motility.
为了积累未鉴定基因编码序列(CDS)的信息,我们开展了一项人类长cDNA克隆的测序项目。通过单通道测序确定了来自两个人脾脏cDNA文库(平均大小分别为4.5 kb和5.6 kb)的约10,000个cDNA克隆的两端序列,以筛选出具有未鉴定序列的cDNA。在此,我们展示了81个cDNA克隆的完整序列,其中大多数是通过两种基于其在计算机上的蛋白质编码潜力的方法筛选出来的:通过应用GeneMark分析,58个cDNA克隆被选为在单通道序列5'端具有蛋白质编码潜力的克隆;通过使用GENSCAN基因预测程序分析cDNA两端序列侧翼的人类基因组序列,20个cDNA克隆被选为预期编码大于100个氨基酸残基的蛋白质的克隆。除了这些新鉴定的cDNA外,还通过菌落杂交实验使用对应于已知基因序列的探针分离出三个cDNA克隆,因为这些cDNA可能包含有关已注释基因的大量新信息。序列数据表明,此处分析的cDNA克隆的插入片段和相应CDS的平均大小分别为5.0 kb和2.0 kb(670个氨基酸残基)。根据对公共数据库的同源性和基序搜索结果,可以确定29个预测基因产物的功能类别;这些预测基因产物中的86%(25个基因产物)被分类为与细胞信号传导/通讯、核酸管理以及细胞结构/运动相关的蛋白质。