Hulsen Tim, de Vlieg Jacob, Groenen Peter M A
Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen, Nijmegen, The Netherlands.
BMC Bioinformatics. 2006 Sep 1;7:398. doi: 10.1186/1471-2105-7-398.
Phylogenetic patterns show the presence or absence of certain genes or proteins in a set of species. They can also be used to determine sets of genes or proteins that occur only in certain evolutionary branches. Phylogenetic patterns analysis has routinely been applied to protein databases such as COG and OrthoMCL, but not upon gene databases. Here we present a tool named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns.
PhyloPat is an easy-to-use webserver, which can be used to query the orthologies of all complete genomes within the EnsMart database using phylogenetic patterns. This enables the determination of sets of genes that occur only in certain evolutionary branches or even single species. We found in total 446,825 genes and 3,164,088 orthologous relationships within the EnsMart v40 database. We used a single linkage clustering algorithm to create 147,922 phylogenetic lineages, using every one of the orthologies provided by Ensembl. PhyloPat provides the possibility of querying with either binary phylogenetic patterns (created by checkboxes) or regular expressions. Specific branches of a phylogenetic tree of the 21 included species can be selected to create a branch-specific phylogenetic pattern. Users can also input a list of Ensembl or EMBL IDs to check which phylogenetic lineage any gene belongs to. The output can be saved in HTML, Excel or plain text format for further analysis. A link to the FatiGO web interface has been incorporated in the HTML output, creating easy access to functional information. Finally, lists of omnipresent, polypresent and oligopresent genes have been included.
PhyloPat is the first tool to combine complete genome information with phylogenetic pattern querying. Since we used the orthologies generated by the accurate pipeline of Ensembl, the obtained phylogenetic lineages are reliable. The completeness and reliability of these phylogenetic lineages will further increase with the addition of newly found orthologous relationships within each new Ensembl release.
系统发育模式显示了一组物种中某些基因或蛋白质的存在与否。它们还可用于确定仅出现在某些进化分支中的基因或蛋白质组。系统发育模式分析通常应用于诸如COG和OrthoMCL等蛋白质数据库,但尚未应用于基因数据库。在此,我们展示了一种名为PhyloPat的工具,它允许使用系统发育模式查询完整的Ensembl基因数据库。
PhyloPat是一个易于使用的网络服务器,可用于使用系统发育模式查询EnsMart数据库中所有完整基因组的直系同源关系。这使得能够确定仅出现在某些进化分支甚至单个物种中的基因集。我们在EnsMart v40数据库中总共发现了446,825个基因和3,164,088个直系同源关系。我们使用单连锁聚类算法,利用Ensembl提供的每一个直系同源关系创建了147,922个系统发育谱系。PhyloPat提供了使用二元系统发育模式(由复选框创建)或正则表达式进行查询的可能性。可以选择21个纳入物种的系统发育树的特定分支来创建特定分支的系统发育模式。用户还可以输入Ensembl或EMBL ID列表,以检查任何基因属于哪个系统发育谱系。输出可以保存为HTML、Excel或纯文本格式以供进一步分析。HTML输出中包含了指向FatiGO网络界面的链接,便于获取功能信息。最后,还包括了普遍存在、多物种存在和寡物种存在基因的列表。
PhyloPat是第一个将完整基因组信息与系统发育模式查询相结合的工具。由于我们使用了Ensembl精确流程生成的直系同源关系,因此获得的系统发育谱系是可靠的。随着每个新的Ensembl版本中新增直系同源关系的加入,这些系统发育谱系的完整性和可靠性将进一步提高。