Centre of Haemato-Oncology, Institute of Cancer, Bart's and the London School of Medicine (QMUL), Charterhouse Square, London EC1M 6BQ, UK.
BMC Evol Biol. 2010 Nov 9;10:343. doi: 10.1186/1471-2148-10-343.
Phylogenetic reconstruction methods based on gene content often place all the parasitic and endosymbiotic eubacteria (parasites for short) together in a clan. Many other lines of evidence point to this parasites clan being an artefact. This artefact could be a consequence of the methods used to construct ortholog databases (due to some unknown bias), the methods used to estimate the phylogeny, or both.We test the idea that the parasites clan is an ortholog identification artefact by analyzing three different ortholog databases (COG, TRIBES, and OFAM), which were constructed using different methods, and are thus unlikely to share the same biases. In each case, we estimate a phylogeny using an improved version of the conditioned logdet distance method. If the parasites clan appears in trees from all three databases, it is unlikely to be an ortholog identification artefact.Accelerated loss of a subset of gene families in parasites (a form of heterotachy) may contribute to the difficulty of estimating a phylogeny from gene content data. We test the idea that heterotachy is the underlying reason for the estimation of an artefactual parasites clan by applying two different mixture models (phylogenetic and non-phylogenetic), in combination with conditioned logdet. In these models, there are two categories of gene families, one of which has accelerated loss in parasites. Distances are estimated separately from each category by conditioned logdet. This should reduce the tendency for tree estimation methods to group the parasites together, if heterotachy is the underlying reason for estimation of the parasites clan.
The parasites clan appears in conditioned logdet trees estimated from all three databases. This makes it less likely to be an artefact of database construction. The non-phylogenetic mixture model gives trees without a parasites clan. However, the phylogenetic mixture model still results in a tree with a parasites clan. Thus, it is not entirely clear whether heterotachy is the underlying reason for the estimation of a parasites clan. Simulation studies suggest that the phylogenetic mixture model approach may be unsuccessful because the model of gene family gain and loss it uses does not adequately describe the real data.
The most successful methods for estimating a reliable phylogenetic tree for parasitic and endosymbiotic eubacteria from gene content data are still ad-hoc approaches such as the SHOT distance method. however, the improved conditioned logdet method we developed here may be useful for non-parasites and can be accessed at http://www.liv.ac.uk/~cgrbios/cond_logdet.html.
基于基因内容的系统发育重建方法通常将所有寄生和内共生真细菌(简称寄生虫)放在一个族中。许多其他证据表明,这个寄生虫族是一种人为产物。这种人为产物可能是构建直系同源数据库的方法(由于某些未知的偏差)、估计系统发育的方法或两者兼而有之的结果。我们通过分析三个不同的直系同源数据库(COG、TRIBES 和 OFAM)来检验寄生虫族是直系同源识别人工制品的想法,这些数据库是使用不同的方法构建的,因此不太可能具有相同的偏差。在每种情况下,我们都使用改进的条件 logdet 距离方法估计系统发育。如果寄生虫族出现在来自所有三个数据库的树中,则不太可能是直系同源识别人工制品。
寄生虫中一组基因家族的加速丢失(一种异速现象)可能会导致从基因内容数据估计系统发育变得困难。我们通过应用两种不同的混合模型(系统发育和非系统发育)结合条件 logdet 来检验异速是估计人为寄生虫族的根本原因的想法。在这些模型中,基因家族有两个类别,其中一个在寄生虫中加速丢失。通过条件 logdet 分别从每个类别估计距离。如果异速是估计寄生虫族的根本原因,这应该会减少树估计方法将寄生虫聚在一起的趋势。
寄生虫族出现在从所有三个数据库估计的条件 logdet 树中。这使得它不太可能是数据库构建的人为产物。非系统发育混合模型给出了没有寄生虫族的树。然而,系统发育混合模型仍然导致了一个带有寄生虫族的树。因此,尚不清楚异速是否是估计寄生虫族的根本原因。模拟研究表明,系统发育混合模型方法可能不成功,因为它使用的基因家族增益和损失模型不能充分描述真实数据。
从基因内容数据估计寄生和内共生真细菌可靠系统发育树的最成功方法仍然是特定方法,例如 SHOT 距离方法。然而,我们在这里开发的改进的条件 logdet 方法可能对非寄生虫有用,可在 http://www.liv.ac.uk/~cgrbios/cond_logdet.html 访问。