Hsu Fan, Kent W James, Clawson Hiram, Kuhn Robert M, Diekhans Mark, Haussler David
Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz Santa Cruz, CA 95064, USA.
Bioinformatics. 2006 May 1;22(9):1036-46. doi: 10.1093/bioinformatics/btl048. Epub 2006 Feb 24.
The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved.
加利福尼亚大学圣克鲁兹分校(UCSC)已知基因数据集是通过一个完全自动化的过程构建的,该过程基于来自Swiss-Prot/TrEMBL(UniProt)的蛋白质数据以及来自Genbank的相关mRNA数据。本文描述了该过程的详细步骤。构建了从该数据集到其他基因组和蛋白质组数据的广泛交叉引用。对于每个已知基因,都提供了一个详细页面,其中包含有关该基因的丰富信息,以及到其他相关基因组、蛋白质组和通路数据的大量链接。截至2005年7月,UCSC已知基因可用于人类、小鼠和大鼠基因组。已知基因是支持几个关键程序的基础:UCSC网站上提供的基因组浏览器、蛋白质组浏览器、基因分类器和表格浏览器。所有相关数据文件和程序源代码也均可获取。可通过http://genome.ucsc.edu访问它们。分析了UCSC已知基因、RefSeq、Ensembl基因、H-Invitational和CCDS的基因组覆盖情况。尽管UCSC已知基因在主要的人类和小鼠基因集中提供了最高的基因组和CDS覆盖率,但更详细的分析表明所有这些都可以进一步改进。