Wells Jonathan N, Marsh Joseph A
MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.
Methods Mol Biol. 2019;1851:251-261. doi: 10.1007/978-1-4939-8736-8_13.
Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the fact that proteins with a large number of similar repeats are more likely to produce significant local sequence alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the order of repeats in one of the sequences being aligned. Combined, these attributes make traditional phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence of such methods on accurate sequence alignment.We present here a practical solution to this problem, making use of graph clustering combined with the open-source software package HH-suite, which enables highly sensitive detection of sequence relationships. Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large sets of related proteins are generated. By representing the relationships between proteins in these sets as graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein subfamilies.
由于重复序列之间通常存在高度的序列差异,重建重复蛋白中的进化关系非常困难。大量相似重复序列的蛋白质比重复基序拷贝数较少的蛋白质更有可能产生显著的局部序列比对,这使得情况更加复杂。此外,在插入或易位事件破坏了其中一个比对序列中重复序列顺序的情况下,有时无法实现生物学上正确的序列比对。综合起来,由于这些方法依赖于准确的序列比对,这些特性使得研究蛋白质家族的传统系统发育方法对重复蛋白不可靠。我们在此提出了一个解决该问题的实用方案,利用图聚类结合开源软件包HH-suite,它能够高度灵敏地检测序列关系。通过对轮廓隐马尔可夫模型进行比对来进行多轮同源性搜索,生成大量相关蛋白质。通过将这些蛋白质集合中的蛋白质之间的关系表示为图,随后使用马尔可夫聚类算法进行聚类,能够可靠地检测重复蛋白亚家族。