University College London, London, United Kingdom.
Swiss Institute of Bioinformatics, Zurich, Switzerland.
PeerJ. 2014 Oct 7;2:e607. doi: 10.7717/peerj.607. eCollection 2014.
Orthology inference and other sequence analyses across multiple genomes typically start by performing exhaustive pairwise sequence comparisons, a process referred to as "all-against-all". As this process scales quadratically in terms of the number of sequences analysed, this step can become a bottleneck, thus limiting the number of genomes that can be simultaneously analysed. Here, we explored ways of speeding-up the all-against-all step while maintaining its sensitivity. By exploiting the transitivity of homology and, crucially, ensuring that homology is defined in terms of consistent protein subsequences, our proof-of-concept resulted in a 4× speedup while recovering >99.6% of all homologs identified by the full all-against-all procedure on empirical sequences sets. In comparison, state-of-the-art k-mer approaches are orders of magnitude faster but only recover 3-14% of all homologous pairs. We also outline ideas to further improve the speed and recall of the new approach. An open source implementation is provided as part of the OMA standalone software at http://omabrowser.org/standalone.
在多个基因组中进行直系同源推断和其他序列分析通常首先执行详尽的两两序列比较,这一过程称为“全对全”。由于该过程在分析的序列数量方面呈二次方扩展,因此这一步骤可能成为瓶颈,从而限制了可以同时分析的基因组数量。在这里,我们探索了在保持其敏感性的同时加快全对全步骤的方法。通过利用同源性的传递性,并且至关重要的是,确保同源性是根据一致的蛋白质子序列定义的,我们的概念验证在经验序列集上以 4 倍的速度提高了速度,同时恢复了全对全过程识别的所有同源物的>99.6%。相比之下,最先进的 k-mer 方法快几个数量级,但仅恢复所有同源对的 3-14%。我们还概述了进一步提高新方法速度和召回率的想法。作为 OMA 独立软件的一部分,在 http://omabrowser.org/standalone 上提供了开源实现。