Leung Henry C M, Siu M H, Yiu S M, Chin Francis Y L, Sung Ken W K
Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong.
Comput Syst Bioinformatics Conf. 2007;6:111-9.
Finding motif pairs from a set of protein sequences based on the protein-protein interaction data is a challenging computational problem. Existing effective approaches usually rely on additional information such as some prior knowledge on protein groupings based on protein domains. In reality, this kind of knowledge is not always available. Novel approaches without using this knowledge is much desirable. Recently, Tan et al. proposed such an approach. However, there are two problems with their approach. The scoring function (using chi(2) testing) used in their approach is not adequate. Random motif pairs may have higher scores than the correct ones. Their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. In this paper, our contribution is two-fold. We first introduce a new scoring method, which is shown to be more accurate than the chi-score used in Tan et al. Then, we present two efficient algorithms, one exact algorithm and a heuristic version of it, to solve the problem of finding motif pairs. Based on experiments on real datasets, we show that our algorithms are efficient and can accurately locate the motif pairs. We have also evaluated the sensitivity and efficiency of our heuristics algorithm using simulated datasets, the results show that the algorithm is very efficient with reasonably high sensitivity.
基于蛋白质 - 蛋白质相互作用数据从一组蛋白质序列中寻找基序对是一个具有挑战性的计算问题。现有的有效方法通常依赖于额外信息,例如基于蛋白质结构域的蛋白质分组的一些先验知识。实际上,这种知识并非总是可用。非常需要不使用此类知识的新方法。最近,谭等人提出了这样一种方法。然而,他们的方法存在两个问题。他们方法中使用的评分函数(使用卡方检验)并不充分。随机基序对可能比正确的基序对得分更高。他们的方法也不可扩展。处理一组具有约20000个相互作用的5000个蛋白质序列可能需要数天时间。在本文中,我们的贡献有两方面。我们首先引入一种新的评分方法,它被证明比谭等人使用的卡方评分更准确。然后,我们提出两种高效算法,一种精确算法及其启发式版本,以解决寻找基序对的问题。基于对真实数据集的实验,我们表明我们的算法是高效的,并且可以准确地定位基序对。我们还使用模拟数据集评估了我们启发式算法的灵敏度和效率,结果表明该算法非常高效且具有相当高的灵敏度。