发散集，一种从大型序列集合中挑选非冗余序列的工具。

DivergentSet, a tool for picking non-redundant sequences from large sequence collections.

作者信息

Widmann Jeremy, Hamady Micah, Knight Rob

机构信息

Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA.

出版信息

Mol Cell Proteomics. 2006 Aug;5(8):1520-32. doi: 10.1074/mcp.T600022-MCP200. Epub 2006 Jun 11.

DOI:10.1074/mcp.T600022-MCP200

PMID:16769708

Abstract

DivergentSet addresses the important but so far neglected bioinformatics task of choosing a representative set of sequences from a larger collection. We found that using a phylogenetic tree to guide the construction of divergent sets of sequences can be up to 2 orders of magnitude faster than the naive method of using a full distance matrix. By providing a user-friendly interface (available online) that integrates the tasks of finding additional sequences, building and refining the divergent set, producing random divergent sets from the same sequences, and exporting identifiers, this software facilitates a wide range of bioinformatics analyses including finding significant motifs and covariations. As an example application of DivergentSet, we demonstrate that the motifs identified by the motif-finding package MEME (Motif Elicitation by Maximum Entropy) are highly unstable with respect to the specific choice of sequences. This instability suggests that the types of sensitivity analysis enabled by DivergentSet may be widely useful for identifying the motifs of biological significance.

摘要

DivergentSet解决了从更大的序列集合中选择一组代表性序列这一重要但迄今为止被忽视的生物信息学任务。我们发现，使用系统发育树来指导构建不同的序列集比使用完整距离矩阵的朴素方法快达2个数量级。通过提供一个用户友好的界面（在线可用），该界面集成了查找额外序列、构建和完善不同序列集、从相同序列生成随机不同序列集以及导出标识符等任务，此软件促进了广泛的生物信息学分析，包括发现显著基序和共变关系。作为DivergentSet的一个示例应用，我们证明了由基序查找软件包MEME（通过最大熵进行基序引出）识别出的基序对于序列的特定选择非常不稳定。这种不稳定性表明，DivergentSet所实现的敏感性分析类型可能在识别具有生物学意义的基序方面具有广泛用途。

相似文献

DivergentSet, a tool for picking non-redundant sequences from large sequence collections.发散集，一种从大型序列集合中挑选非冗余序列的工具。

Mol Cell Proteomics. 2006 Aug;5(8):1520-32. doi: 10.1074/mcp.T600022-MCP200. Epub 2006 Jun 11.

Hidden Markov model analysis of motifs in steroid dehydrogenases and their homologs.类固醇脱氢酶及其同源物中基序的隐马尔可夫模型分析

Biochem Biophys Res Commun. 1997 Feb 24;231(3):760-6. doi: 10.1006/bbrc.1997.6193.

An artificial intelligence approach to motif discovery in protein sequences: application to steriod dehydrogenases.一种用于蛋白质序列中基序发现的人工智能方法：在甾体脱氢酶中的应用。

J Steroid Biochem Mol Biol. 1997 May;62(1):29-44. doi: 10.1016/s0960-0760(97)00013-7.

MULTICOMP: a program for preparing sequence data for phylogenetic analysis.MULTICOMP：一个用于为系统发育分析准备序列数据的程序。

Comput Appl Biosci. 1994 Jun;10(3):281-4. doi: 10.1093/bioinformatics/10.3.281.

Discovering novel sequence motifs with MEME.使用MEME发现新的序列基序。

Curr Protoc Bioinformatics. 2002 Nov;Chapter 2:Unit 2.4. doi: 10.1002/0471250953.bi0204s00.

PhyloGena--a user-friendly system for automated phylogenetic annotation of unknown sequences.PhyloGena——一个用于对未知序列进行自动系统发育注释的用户友好型系统。

Bioinformatics. 2007 Apr 1;23(7):793-801. doi: 10.1093/bioinformatics/btm016. Epub 2007 Mar 1.

DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution.DAMBE5：一个用于分子生物学和进化数据分析的综合软件包。

Mol Biol Evol. 2013 Jul;30(7):1720-8. doi: 10.1093/molbev/mst064. Epub 2013 Apr 5.

SeqRepo: A system for managing local collections of biological sequences.SeqRepo：一个用于管理生物序列本地集合的系统。

PLoS One. 2020 Dec 3;15(12):e0239883. doi: 10.1371/journal.pone.0239883. eCollection 2020.

ggmotif: An R Package for the extraction and visualization of motifs from MEME software.ggmotif：一个从 MEME 软件中提取和可视化基序的 R 包。

PLoS One. 2022 Nov 3;17(11):e0276979. doi: 10.1371/journal.pone.0276979. eCollection 2022.

A computer aided system for systematic production and revision of sequence patterns.一种用于系统生成和修订序列模式的计算机辅助系统。

Biochimie. 1996;78(5):370-5. doi: 10.1016/0300-9084(96)84769-9.

引用本文的文献

E2 superfamily of ubiquitin-conjugating enzymes: constitutively active or activated through phosphorylation in the catalytic cleft.泛素结合酶的E2超家族：组成型活性或通过催化裂隙中的磷酸化而激活。

Sci Rep. 2015 Oct 14;5:14849. doi: 10.1038/srep14849.

Loop 7 of E2 enzymes: an ancestral conserved functional motif involved in the E2-mediated steps of the ubiquitination cascade.E2 酶的环 7：参与泛素化级联的 E2 介导步骤的保守功能基序。

PLoS One. 2012;7(7):e40786. doi: 10.1371/journal.pone.0040786. Epub 2012 Jul 18.

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.高效的大规模蛋白质序列比较和基因匹配，以识别直系同源物和共直系同源物。

Nucleic Acids Res. 2012 Mar;40(6):e44. doi: 10.1093/nar/gkr1261. Epub 2011 Dec 30.

Phylogeography of microbial phototrophs in the dry valleys of the high Himalayas and Antarctica.高喜马拉雅山脉和南极洲干旱谷中微生物光养生物的系统地理学研究。

Proc Biol Sci. 2011 Mar 7;278(1706):702-8. doi: 10.1098/rspb.2010.1254. Epub 2010 Sep 8.

Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data.快速 UniFrac：促进高通量微生物群落的系统发育分析，包括对 pyrosequencing 和 PhyloChip 数据的分析。

ISME J. 2010 Jan;4(1):17-27. doi: 10.1038/ismej.2009.97. Epub 2009 Aug 27.

MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs.MotifCluster：一个用于使用共享基序对序列进行聚类和可视化的交互式在线工具。

Genome Biol. 2008;9(8):R128. doi: 10.1186/gb-2008-9-8-r128. Epub 2008 Aug 15.

PyCogent: a toolkit for making sense from sequence.PyCogent：一个用于理解序列的工具包。

Genome Biol. 2007;8(8):R171. doi: 10.1186/gb-2007-8-8-r171.

Global patterns in bacterial diversity.细菌多样性的全球模式。

Proc Natl Acad Sci U S A. 2007 Jul 3;104(27):11436-40. doi: 10.1073/pnas.0611525104. Epub 2007 Jun 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

发散集，一种从大型序列集合中挑选非冗余序列的工具。

DivergentSet, a tool for picking non-redundant sequences from large sequence collections.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献