网络拓扑和多重比对可能性引导的高质量序列聚类。

High-quality sequence clustering guided by network topology and multiple alignment likelihood.

机构信息

Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France.

出版信息

Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.

DOI:10.1093/bioinformatics/bts098

PMID:22368255

Abstract

MOTIVATION

Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide useful information regarding the function and evolution of genes. One important difficulty of clustering methods is to distinguish highly divergent homologous sequences from sequences that only share partial homology due to evolution by protein domain rearrangements. Existing clustering methods require parameters that have to be set a priori. Given the variability in the evolution pattern among proteins, these parameters cannot be optimal for all gene families.

RESULTS

We propose a strategy that aims at clustering sequences homologous over their entire length, and that takes into account the pattern of substitution specific to each gene family. Sequences are first all compared with each other and clustered into pre-families, based on pairwise similarity criteria, with permissive parameters to optimize sensitivity. Pre-families are then divided into homogeneous clusters, based on the topology of the similarity network. Finally, clusters are progressively merged into families, for which we compute multiple alignments, and we use a model selection technique to find the optimal tradeoff between the number of families and multiple alignment likelihood. To evaluate this method, called HiFiX, we analyzed simulated sequences and manually curated datasets. These tests showed that HiFiX is the only method robust to both sequence divergence and domain rearrangements. HiFiX is fast enough to be used on very large datasets.

AVAILABILITY AND IMPLEMENTATION

The Python software HiFiX is freely available at http://lbbe.univ-lyon1.fr/hifix.

摘要

动机

蛋白质可以自然地分为同源序列家族，这些序列家族源自共同的祖先。同源序列的比较和它们的系统发育关系的分析为基因的功能和进化提供了有用的信息。聚类方法的一个重要难点是区分高度分歧的同源序列和由于蛋白质结构域重排而仅部分同源的序列。现有的聚类方法需要先验设置参数。鉴于蛋白质进化模式的可变性，这些参数不可能对所有基因家族都是最优的。

结果

我们提出了一种策略，旨在对整个长度同源的序列进行聚类，并考虑到每个基因家族特有的替代模式。首先，根据成对相似性标准，使用允许的参数来优化敏感性，将所有序列彼此进行比较并聚类为预家族。然后，根据相似性网络的拓扑结构将预家族划分为同质簇。最后，将簇逐步合并为家族，对于这些家族，我们计算多重比对，并使用模型选择技术来找到家族数量和多重比对可能性之间的最佳权衡。为了评估这种称为 HiFiX 的方法，我们分析了模拟序列和手动整理的数据集。这些测试表明，HiFiX 是唯一一种对序列分歧和结构域重排都具有鲁棒性的方法。HiFiX 足够快，可以用于非常大的数据集。