Institut de Biologie Paris-Seine (IBPS), UPMC Université Paris 06, Sorbonne Universités, Paris, France.
Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada.
Mol Biol Evol. 2018 Jan 1;35(1):252-255. doi: 10.1093/molbev/msx283.
Genes evolve by point mutations, but also by shuffling, fusion, and fission of genetic fragments. Therefore, similarity between two sequences can be due to common ancestry producing homology, and/or partial sharing of component fragments. Disentangling these processes is especially challenging in large molecular data sets, because of computational time. In this article, we present CompositeSearch, a memory-efficient, fast, and scalable method to detect composite gene families in large data sets (typically in the range of several million sequences). CompositeSearch generalizes the use of similarity networks to detect composite and component gene families with a greater recall, accuracy, and precision than recent programs (FusedTriplets and MosaicFinder). Moreover, CompositeSearch provides user-friendly quality descriptions regarding the distribution and primary sequence conservation of these gene families allowing critical biological analyses of these data.
基因通过点突变进化,但也通过遗传片段的重排、融合和分裂进化。因此,两个序列之间的相似性可能是由于同源性产生的共同祖先,也可能是由于组成片段的部分共享。在大型分子数据集,由于计算时间的原因,区分这些过程尤其具有挑战性。在本文中,我们提出了 CompositeSearch,这是一种内存高效、快速且可扩展的方法,用于在大型数据集(通常在几百万个序列的范围内)中检测复合基因家族。CompositeSearch 将相似性网络的使用推广到检测复合和组成基因家族,其召回率、准确性和精度都高于最近的程序(FusedTriplets 和 MosaicFinder)。此外,CompositeSearch 提供了关于这些基因家族分布和原始序列保守性的用户友好的质量描述,从而可以对这些数据进行关键的生物学分析。