Korbel Jan O, Doerks Tobias, Jensen Lars J, Perez-Iratxeta Carolina, Kaczanowski Szymon, Hooper Sean D, Andrade Miguel A, Bork Peer
European Molecular Biology Laboratory, Heidelberg, Germany.
PLoS Biol. 2005 May;3(5):e134. doi: 10.1371/journal.pbio.0030134. Epub 2005 Apr 5.
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene-phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.
功能基因组学的主要挑战之一是揭示基因型与表型之间的联系。到目前为止,还没有全局分析尝试根据自然界中观察到的巨大表型变异性来探索这些联系。在这里,我们使用一种无监督的系统方法来关联基因和表型特征,该方法将文献挖掘与比较基因组分析相结合。我们首先在MEDLINE文献数据库中挖掘反映物种表型相似性的术语。随后,我们预测可能的基因组决定因素:各自基因组中特有的基因。在一项涉及92个原核生物基因组的全局分析中,我们检索到323个聚类,总共包含2700个显著的基因-表型关联。一些聚类大多包含已知的关系,例如参与运动或植物降解的基因,通常还伴有与这些表型相关的其他假设蛋白质。其他聚类包含意想不到的关联;例如,一组与食物和腐败相关的术语与预计参与细菌性食物中毒的基因相关联。在这些聚类中,我们观察到与致病性相关的关联富集,这表明该方法揭示了许多可能在传染病中起作用的新基因。