Suppr
超能文献

MotifCut：通过最大密度子图寻找调控基序

MotifCut: regulatory motifs finding with maximum density subgraphs.

作者信息

Fratkin Eugene, Naughton Brian T, Brutlag Douglas L, Batzoglou Serafim

机构信息

Department of Computer Science, Stanford University, California 94305, USA.

出版信息

Bioinformatics. 2006 Jul 15;22(14):e150-7. doi: 10.1093/bioinformatics/btl243.

DOI:10.1093/bioinformatics/btl243

PMID:16873465

Abstract

MOTIVATION

DNA motif finding is one of the core problems in computational biology, for which several probabilistic and discrete approaches have been developed. Most existing methods formulate motif finding as an intractable optimization problem and rely either on expectation maximization (EM) or on local heuristic searches. Another challenge is the choice of motif model: simpler models such as the position-specific scoring matrix (PSSM) impose biologically unrealistic assumptions such as independence of the motif positions, while more involved models are harder to parametrize and learn.

RESULTS

We present MotifCut, a graph-theoretic approach to motif finding leading to a convex optimization problem with a polynomial time solution. We build a graph where the vertices represent all k-mers in the input sequences, and edges represent pairwise k-mer similarity. In this graph, we search for a motif as the maximum density subgraph, which is a set of k-mers that exhibit a large number of pairwise similarities. Our formulation does not make strong assumptions regarding the structure of the motif and in practice both motifs that fit well the PSSM model, and those that exhibit strong dependencies between position pairs are found as dense subgraphs. We benchmark MotifCut on both synthetic and real yeast motifs, and find that it compares favorably to existing popular methods. The ability of MotifCut to detect motifs appears to scale well with increasing input size. Moreover, the motifs we discover are different from those discovered by the other methods.

AVAILABILITY

MotifCut server and other materials can be found at motifcut.stanford.edu.

摘要

动机

DNA 基序查找是计算生物学中的核心问题之一，针对此问题已开发出多种概率和离散方法。大多数现有方法将基序查找表述为一个难以处理的优化问题，并且要么依赖期望最大化（EM），要么依赖局部启发式搜索。另一个挑战是基序模型的选择：诸如位置特异性得分矩阵（PSSM）等较简单的模型会强加一些生物学上不现实的假设，例如基序位置的独立性，而更复杂的模型则更难进行参数化和学习。

结果

我们提出了MotifCut，一种用于基序查找的图论方法，它会导致一个具有多项式时间解的凸优化问题。我们构建一个图，其中顶点代表输入序列中的所有 k 元组，边代表成对的 k 元组相似性。在这个图中，我们将基序搜索为最大密度子图，即一组表现出大量成对相似性的 k 元组。我们的公式没有对基序的结构做出强有力的假设，并且在实践中，既适合 PSSM 模型的基序，也能找到那些在位置对之间表现出强依赖性的基序作为密集子图。我们在合成和真实的酵母基序上对MotifCut进行了基准测试，发现它与现有的流行方法相比具有优势。MotifCut检测基序的能力似乎随着输入大小的增加而扩展良好。此外，我们发现的基序与其他方法发现的基序不同。