Taylor William R
Francis Crick Institute, 1 Midland Road, London, NW1 1AT UK.
Algorithms Mol Biol. 2017 Sep 25;12:24. doi: 10.1186/s13015-017-0115-y. eCollection 2017.
In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. When each protein is unique then the species name is sufficient to guide this match, however, when there are multiple related sequences (paralogs) in each species then the pairing is more difficult. In bacteria a good guide can be gained from genome co-location as interacting proteins tend to be in a common operon but in eukaryotes this simple principle is not sufficient.
The methods developed in this paper take sets of paralogs for different proteins found in the same species and make a pairing based on their evolutionary distance relative to a set of other proteins that are unique and so have a known relationship (singletons). The former constitute a set of unlabelled nodes in a graph while the latter are labelled. Two variants were tested, one based on a phylogenetic tree of the sequences (the topology-based method) and a simpler, faster variant based only on the inter-sequence distances (the distance-based method). Over a set of test proteins, both gave good results, with the topology method performing slightly better.
The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences.
为了找到蛋白质之间的相关位置对,这对预测相互作用很有用,有必要连接两个大型多序列比对,使得连接在一起的序列属于在其原始物种中相互作用的那些序列。当每个蛋白质都是独特的时,物种名称足以指导这种匹配,然而,当每个物种中有多个相关序列(旁系同源物)时,配对就更加困难。在细菌中,可以从基因组共定位获得很好的指导,因为相互作用的蛋白质往往存在于同一个操纵子中,但在真核生物中,这个简单的原则是不够的。
本文开发的方法采用在同一物种中发现的不同蛋白质的旁系同源物集,并根据它们相对于一组独特的、因此具有已知关系的其他蛋白质(单拷贝基因)的进化距离进行配对。前者在图中构成一组未标记的节点,而后者是有标记的。测试了两种变体,一种基于序列的系统发育树(基于拓扑的方法),另一种更简单、更快的变体仅基于序列间距离(基于距离的方法)。在一组测试蛋白质上,两种方法都取得了很好的结果,拓扑方法的表现略好一些。
这里开发的方法仍然需要从仅序列数据之外的其他约束条件进行完善和扩充,例如来自注释和数据库的已知相互作用,或基因组位置中的非平凡关系。随着真核生物基因组数量的不断增加,希望这里描述的方法将开辟一条利用这些数据的途径,其效果能与目前细菌序列所取得的成功相当。