BMC Bioinformatics. 2013;14 Suppl 15(Suppl 15):S12. doi: 10.1186/1471-2105-14-S15-S12. Epub 2013 Oct 15.
Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential.
Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data.
The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.
将序列聚类成家族一直是基因和蛋白质特征描述的重要步骤。为此目的开发了许多算法,其中大多数基于基因对之间的直接相似性或某种网络结构,其中构建图的边的权重基于相似性。然而,保守的同线性是可以帮助区分同源性的重要信号,但尚未充分利用。
在这里,我们提出了 GenFamClust,这是一个结合了序列相似性和同线性的网络特性的管道,用于评估同源关系并将已知的同源物合并成基因家族组。与基于相似性的方法相比,GenFamClust 以更明智和更准确的方式识别同源物。我们在包含真核生物全序列基因组和合成数据的两个不同数据集上对我们的方法进行了测试。
来自两个数据集的结果均证实,同线性有助于确定同源性,而 GenFamClust 则优于邻居相关性方法。准确性以及同线性得分的定义是 GenFamClust 最有价值的贡献。