挖掘微生物EST数据库以寻找新蛋白质。

Mining microorganism EST databases in the quest for new proteins.

作者信息

Faria-Campos Alessandra Conceição, Cerqueira Gustavo Coutinho, Anacleto Charles, de Carvalho Cláudia Márcia Benedetto, Ortega José Miguel

机构信息

Laboratório de Biodados, Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627, Pampulha, Caixa Postal 486, 31270-010 Belo Horizonte, MG, Brasil.

出版信息

Genet Mol Res. 2003 Mar 31;2(1):169-77.

PMID:12917813

Abstract

Microorganisms with large genomes are commonly the subjects of single-round partial sequencing of cDNA, generating expressed sequence tags (ESTs). Usually there is a great distance between gene discovery by EST projects and submission of amino acid sequences to public databases. We analyzed the relationship between available ESTs and protein sequences and used the sequences available in the secondary database, clusters of orthologous groups (COG), to investigate ESTs from eight microorganisms of medical and/or economic relevance, selecting for candidate ESTs that may be further pursued for protein characterization. The organisms chosen were Paracoccidioides brasiliensis, Dictyostelium discoideum, Fusarium graminearum, Plasmodium yoelii, Magnaporthe grisea, Emericella nidulans, Chlamydomonas reinhardtii and Eimeria tenella, which have more than 10,000 ESTs available in dbEST. A total of 77,114 protein sequences from COG were used, corresponding to 3,201 distinct genes. At least 212 of these were capable of identifying candidate ESTs for further studies (E. tenella). This number was extended to over 700 candidate ESTs (C. reinhardtii, F. graminearum). Remarkably, even the organism that presents the highest number of ESTs corresponding to known proteins, P. yoelii, showed a considerable number of candidate ESTs for protein characterization (477). For some organisms, such as P. brasiliensis, M. grisea and F. graminearum, bioinformatics has allowed for automatic annotation of up to about 20% of the ESTs that did not correspond to proteins already characterized in the organism. In conclusion, 4093 ESTs from these eight organisms that are homologous to COG genes were selected as candidates for protein characterization.

摘要

具有大基因组的微生物通常是cDNA单轮部分测序的研究对象，从而产生表达序列标签（EST）。通常，通过EST项目发现基因与将氨基酸序列提交到公共数据库之间存在很大差距。我们分析了可用EST与蛋白质序列之间的关系，并使用二级数据库直系同源簇（COG）中的序列，来研究来自八种具有医学和/或经济相关性的微生物的EST，筛选出可能进一步用于蛋白质表征研究的候选EST。所选的生物有巴西副球孢子菌、盘基网柄菌、禾谷镰刀菌、约氏疟原虫、稻瘟病菌、构巢曲霉、莱茵衣藻和柔嫩艾美耳球虫，它们在dbEST中有超过10,000条可用EST。总共使用了来自COG的77,114条蛋白质序列，对应3201个不同的基因。其中至少有212个能够识别出可供进一步研究的候选EST（柔嫩艾美耳球虫）。这个数字扩展到了700多个候选EST（莱茵衣藻、禾谷镰刀菌）。值得注意的是，即使是对应已知蛋白质的EST数量最多的生物——约氏疟原虫，也显示出相当数量的可供蛋白质表征研究的候选EST（477个）。对于一些生物，如巴西副球孢子菌、稻瘟病菌和禾谷镰刀菌，生物信息学已允许对高达约20%的与该生物中已表征蛋白质不对应的EST进行自动注释。总之，从这八种生物中筛选出了4093个与COG基因同源的EST作为蛋白质表征的候选对象。