Schultz J, Doerks T, Ponting C P, Copley R R, Bork P
[1] EMBL, Heidelberg, Germany. [2] Max-Delbrück-Center, Berlin-Buch, Germany.
Nat Genet. 2000 Jun;25(2):201-4. doi: 10.1038/76069.
Cloning procedures aided by homology searches of EST databases have accelerated the pace of discovery of new genes, but EST database searching remains an involved and onerous task. More than 1.6 million human EST sequences have been deposited in public databases, making it difficult to identify ESTs that represent new genes. Compounding the problems of scale are difficulties in detection associated with a high sequencing error rate and low sequence similarity between distant homologues. We have developed a new method, coupling BLAST-based searches with a domain identification protocol, that filters candidate homologues. Application of this method in a large-scale analysis of 100 signalling domain families has led to the identification of ESTs representing more than 1,000 novel human signalling genes. The 4,206 publicly available ESTs representing these genes are a valuable resource for rapid cloning of novel human signalling proteins. For example, we were able to identify ESTs of at least 106 new small GTPases, of which 6 are likely to belong to new subfamilies. In some cases, further analyses of genomic DNA led to the discovery of previously unidentified full-length protein sequences. This is exemplified by the in silico cloning (prediction of a gene product sequence using only genomic and EST sequence data) of a new type of GTPase with two catalytic domains.
借助EST数据库同源性搜索的克隆程序加快了新基因的发现速度,但搜索EST数据库仍然是一项复杂且艰巨的任务。超过160万条人类EST序列已存入公共数据库,这使得识别代表新基因的EST变得困难。与高测序错误率和远缘同源物之间低序列相似性相关的检测困难进一步加剧了规模问题。我们开发了一种新方法,将基于BLAST的搜索与结构域识别协议相结合,以筛选候选同源物。将该方法应用于对100个信号结构域家族的大规模分析,已鉴定出代表1000多个新型人类信号基因的EST。代表这些基因的4206条公开可用的EST是快速克隆新型人类信号蛋白的宝贵资源。例如,我们能够鉴定出至少106种新的小GTP酶的EST,其中6种可能属于新的亚家族。在某些情况下,对基因组DNA的进一步分析导致发现了以前未鉴定的全长蛋白质序列。这以一种具有两个催化结构域的新型GTP酶的电子克隆(仅使用基因组和EST序列数据预测基因产物序列)为例。