基于 n-gram 和频谱重排的序列分析可视化框架。

A visual framework for sequence analysis using n-grams and spectral rearrangement.

机构信息

Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4072, Australia.

出版信息

Bioinformatics. 2010 Mar 15;26(6):737-44. doi: 10.1093/bioinformatics/btq042. Epub 2010 Feb 3.

DOI:10.1093/bioinformatics/btq042

PMID:20130028

Abstract

MOTIVATION

Protein sequences are often composed of regions that have distinct evolutionary histories as a consequence of domain shuffling, recombination or gene conversion. New approaches are required to discover, visualize and analyze these sequence regions and thus enable a better understanding of protein evolution.

RESULTS

Here, we have developed an alignment-free and visual approach to analyze sequence relationships. We use the number of shared n-grams between sequences as a measure of sequence similarity and rearrange the resulting affinity matrix applying a spectral technique. Heat maps of the affinity matrix are employed to identify and visualize clusters of related sequences or outliers, while n-gram-based dot plots and conservation profiles allow detailed analysis of similarities among selected sequences. Using this approach, we have identified signatures of domain shuffling in an otherwise poorly characterized family, and homology clusters in another. We conclude that this approach may be generally useful as a framework to analyze related, but highly divergent protein sequences. It is particularly useful as a fast method to study sequence relationships prior to much more time-consuming multiple sequence alignment and phylogenetic analysis.

AVAILABILITY

A software implementation (MOSAIC) of the framework described here can be downloaded from http://bioinformatics.org.au/mosaic/

CONTACT

m.ragan@uq.edu.au

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

蛋白质序列通常由由于结构域改组、重组或基因转换而具有不同进化历史的区域组成。需要新的方法来发现、可视化和分析这些序列区域，从而更好地理解蛋白质进化。

结果

在这里，我们开发了一种无对齐和可视化的方法来分析序列关系。我们使用序列之间共享的 n 元组数量作为序列相似性的度量，并应用谱技术重新排列得到的亲和矩阵。亲和矩阵的热图用于识别和可视化相关序列或异常值的聚类，而基于 n 元组的点图和保守性图允许对选定序列之间的相似性进行详细分析。使用这种方法，我们已经在一个特征较差的家族中识别出了结构域改组的特征，并且在另一个家族中识别出了同源聚类。我们得出的结论是，这种方法可能是一种有用的框架，用于分析相关但高度不同的蛋白质序列。它特别适用于在进行更耗时的多序列比对和系统发育分析之前，快速研究序列关系。

可用性

此处描述的框架的软件实现（MOSAIC）可从 http://bioinformatics.org.au/mosaic/ 下载。

联系方式

m.ragan@uq.edu.au

补充信息

补充数据可在生物信息学在线获得。

相似文献

A visual framework for sequence analysis using n-grams and spectral rearrangement.基于 n-gram 和频谱重排的序列分析可视化框架。

Bioinformatics. 2010 Mar 15;26(6):737-44. doi: 10.1093/bioinformatics/btq042. Epub 2010 Feb 3.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Efficient functional clustering of protein sequences using the Dirichlet process.使用狄利克雷过程对蛋白质序列进行高效功能聚类。

Bioinformatics. 2008 Aug 15;24(16):1765-71. doi: 10.1093/bioinformatics/btn244. Epub 2008 May 29.

An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.一种蛋白质序列与结构分析及建模的综合方法。III. 使用多重结构比对对蛋白质结构家族中的序列保守性进行比较研究。

J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.

PROMALS: towards accurate multiple sequence alignments of distantly related proteins.PROMALS：用于实现远缘相关蛋白质准确多序列比对

Bioinformatics. 2007 Apr 1;23(7):802-8. doi: 10.1093/bioinformatics/btm017. Epub 2007 Jan 31.

Multiple alignment by sequence annealing.通过序列退火进行多序列比对。

Bioinformatics. 2007 Jan 15;23(2):e24-9. doi: 10.1093/bioinformatics/btl311.

Meta-DP: domain prediction meta-server.元数据处理：域预测元服务器。

Bioinformatics. 2005 Jun 15;21(12):2917-20. doi: 10.1093/bioinformatics/bti445. Epub 2005 Apr 19.

Multiple Alignment of protein structures and sequences for VMD.用于VMD的蛋白质结构和序列的多序列比对。

Bioinformatics. 2006 Feb 15;22(4):504-6. doi: 10.1093/bioinformatics/bti825. Epub 2005 Dec 8.

Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined.基于快速串联质谱的蛋白质鉴定，无论所检测的谱图数量或潜在修饰如何。

Bioinformatics. 2005 May 15;21(10):2177-84. doi: 10.1093/bioinformatics/bti362. Epub 2005 Mar 3.

The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships.通过排除系统发育关系的信息，利用共进化分析推断蛋白质-蛋白质相互作用的方法得到了改进。

Bioinformatics. 2005 Sep 1;21(17):3482-9. doi: 10.1093/bioinformatics/bti564. Epub 2005 Jun 30.

引用本文的文献

Alignment-free sequence comparison: benefits, applications, and tools.无比对信息的序列比对：优势、应用和工具。

Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7.

Alignment-free inference of hierarchical and reticulate phylogenomic relationships.基于无比对的方法推断系统发生的分支和网状结构关系。

Brief Bioinform. 2019 Mar 22;20(2):426-435. doi: 10.1093/bib/bbx067.

From Molecular Phylogenetics to Quantum Chemistry: Discovering Enzyme Design Principles through Computation.从分子系统发育学到量子化学：通过计算发现酶的设计原理。

Comput Struct Biotechnol J. 2012 Nov 30;2:e201209018. doi: 10.5936/csbj.201209018. eCollection 2012.

Information theory applications for biological sequence analysis.信息论在生物序列分析中的应用。

Brief Bioinform. 2014 May;15(3):376-89. doi: 10.1093/bib/bbt068. Epub 2013 Sep 20.

Mining for class-specific motifs in protein sequence classification.蛋白质序列分类中的类特异性基序挖掘。

BMC Bioinformatics. 2013 Mar 15;14:96. doi: 10.1186/1471-2105-14-96.

Functional biogeography of ocean microbes revealed through non-negative matrix factorization.通过非负矩阵分解揭示海洋微生物的功能生物地理学。

PLoS One. 2012;7(9):e43866. doi: 10.1371/journal.pone.0043866. Epub 2012 Sep 18.

The mammalian PYHIN gene family: phylogeny, evolution and expression.哺乳动物 PYHIN 基因家族：系统发育、进化与表达。

BMC Evol Biol. 2012 Aug 7;12:140. doi: 10.1186/1471-2148-12-140.

A non-negative matrix factorization framework for identifying modular patterns in metagenomic profile data.一种用于识别宏基因组谱数据中模块化模式的非负矩阵分解框架。

J Math Biol. 2012 Mar;64(4):697-711. doi: 10.1007/s00285-011-0428-2. Epub 2011 Jun 1.

N-gram analysis of 970 microbial organisms reveals presence of biological language models.对 970 种微生物的 N 元组分析揭示了生物语言模型的存在。

BMC Bioinformatics. 2011 Jan 10;12:12. doi: 10.1186/1471-2105-12-12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 n-gram 和频谱重排的序列分析可视化框架。

A visual framework for sequence analysis using n-grams and spectral rearrangement.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性

联系方式

补充信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献