Widmann Jeremy, Hamady Micah, Knight Rob
Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA.
Mol Cell Proteomics. 2006 Aug;5(8):1520-32. doi: 10.1074/mcp.T600022-MCP200. Epub 2006 Jun 11.
DivergentSet addresses the important but so far neglected bioinformatics task of choosing a representative set of sequences from a larger collection. We found that using a phylogenetic tree to guide the construction of divergent sets of sequences can be up to 2 orders of magnitude faster than the naive method of using a full distance matrix. By providing a user-friendly interface (available online) that integrates the tasks of finding additional sequences, building and refining the divergent set, producing random divergent sets from the same sequences, and exporting identifiers, this software facilitates a wide range of bioinformatics analyses including finding significant motifs and covariations. As an example application of DivergentSet, we demonstrate that the motifs identified by the motif-finding package MEME (Motif Elicitation by Maximum Entropy) are highly unstable with respect to the specific choice of sequences. This instability suggests that the types of sensitivity analysis enabled by DivergentSet may be widely useful for identifying the motifs of biological significance.
DivergentSet解决了从更大的序列集合中选择一组代表性序列这一重要但迄今为止被忽视的生物信息学任务。我们发现,使用系统发育树来指导构建不同的序列集比使用完整距离矩阵的朴素方法快达2个数量级。通过提供一个用户友好的界面(在线可用),该界面集成了查找额外序列、构建和完善不同序列集、从相同序列生成随机不同序列集以及导出标识符等任务,此软件促进了广泛的生物信息学分析,包括发现显著基序和共变关系。作为DivergentSet的一个示例应用,我们证明了由基序查找软件包MEME(通过最大熵进行基序引出)识别出的基序对于序列的特定选择非常不稳定。这种不稳定性表明,DivergentSet所实现的敏感性分析类型可能在识别具有生物学意义的基序方面具有广泛用途。