del Val Coral, Mehrle Alexander, Falkenhahn Mechthild, Seiler Markus, Glatting Karl-Heinz, Poustka Annemarie, Suhai Sandor, Wiemann Stefan
Division of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, D-69120 Heidelberg, Germany.
Nucleic Acids Res. 2004 Feb 3;32(2):742-8. doi: 10.1093/nar/gkh257. Print 2004.
The wealth of transcript information that has been made publicly available in recent years requires the development of high-throughput functional genomics and proteomics approaches for its analysis. Such approaches need suitable data integration procedures and a high level of automation in order to gain maximum benefit from the results generated. We have designed an automatic pipeline to analyse annotated open reading frames (ORFs) stemming from full-length cDNAs produced mainly by the German cDNA Consortium. The ORFs are cloned into expression vectors for use in large-scale assays such as the determination of subcellular protein localization or kinase reaction specificity. Additionally, all identified ORFs undergo exhaustive bioinformatic analysis such as similarity searches, protein domain architecture determination and prediction of physicochemical characteristics and secondary structure, using a wide variety of bioinformatic methods in combination with the most up-to-date public databases (e.g. PRINTS, BLOCKS, INTERPRO, PROSITE SWISSPROT). Data from experimental results and from the bioinformatic analysis are integrated and stored in a relational database (MS SQL-Server), which makes it possible for researchers to find answers to biological questions easily, thereby speeding up the selection of targets for further analysis. The designed pipeline constitutes a new automatic approach to obtaining and administrating relevant biological data from high-throughput investigations of cDNAs in order to systematically identify and characterize novel genes, as well as to comprehensively describe the function of the encoded proteins.
近年来公开的大量转录本信息需要开发高通量功能基因组学和蛋白质组学方法来进行分析。此类方法需要合适的数据整合程序和高度自动化,以便从所产生的结果中获得最大收益。我们设计了一个自动流程,用于分析主要由德国cDNA联盟产生的全长cDNA中的注释开放阅读框(ORF)。这些ORF被克隆到表达载体中,用于大规模检测,如亚细胞蛋白质定位测定或激酶反应特异性测定。此外,所有鉴定出的ORF都要经过详尽的生物信息学分析,如相似性搜索、蛋白质结构域结构确定以及理化特性和二级结构预测,使用多种生物信息学方法并结合最新的公共数据库(如PRINTS、BLOCKS、INTERPRO、PROSITE、SWISSPROT)。实验结果和生物信息学分析的数据被整合并存储在关系数据库(MS SQL-Server)中,这使得研究人员能够轻松找到生物学问题的答案,从而加快进一步分析的目标选择。所设计的流程构成了一种新的自动方法,用于从cDNA的高通量研究中获取和管理相关生物学数据,以便系统地鉴定和表征新基因,以及全面描述编码蛋白质的功能。