大型多样序列家族的归约、比对与可视化

Reduction, alignment and visualisation of large diverse sequence families.

作者信息

Taylor William R

机构信息

Francsis Crick Institute, 1 Midland Rd., London, NW1 1AT, UK.

出版信息

BMC Bioinformatics. 2016 Aug 2;17(1):300. doi: 10.1186/s12859-016-1059-9.

DOI:10.1186/s12859-016-1059-9

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4971687/

Abstract

BACKGROUND

Current volumes of sequence data can lead to large numbers of hits identified on a search, typically in the range of 10s to 100s of thousands. It is often quite difficult to tell from these raw results whether the search has been a success or has picked-up sequences with little or no relationship to the query. The best approach to this problem is to cluster and align the resulting families, however, existing methods concentrate on fast clustering and either do not align the sequences or only perform a limited alignment.

RESULTS

A method (MULSEL) is presented that combines fast peptide-based pre-sorting with a following cascade of mini-alignments, each of which are generated with a robust profile/profile method. From these mini-alignments, a representative sequence is selected, based on a variety of intrinsic and user-specified criteria that are combined to produce the sequence collection for the next cycle of alignment. For moderate sized sequence collections (10s of thousands) the method executes on a laptop computer within seconds or minutes.

CONCLUSIONS

MULSEL bridges a gap between fast clustering methods and slower multiple sequence alignment methods and provides a seamless transition from one to the other. Furthermore, it presents the resulting reduced family in a graphical manner that makes it clear if family members have been misaligned or if there are sequences present that appear inconsistent.

摘要

背景

当前的序列数据量可能导致在搜索时识别出大量匹配结果，通常在数万到数十万的范围内。从这些原始结果中往往很难判断搜索是否成功，或者是否找到了与查询几乎没有关系的序列。解决这个问题的最佳方法是对结果家族进行聚类和比对，然而，现有方法专注于快速聚类，要么不对序列进行比对，要么只进行有限的比对。

结果

提出了一种方法（MULSEL），该方法将基于肽的快速预排序与随后的一系列小型比对相结合，每个小型比对都使用强大的profile/profile方法生成。根据各种内在和用户指定的标准从这些小型比对中选择一个代表性序列，这些标准相结合以产生用于下一轮比对的序列集合。对于中等规模的序列集合（数万条），该方法在笔记本电脑上只需几秒或几分钟即可执行。

结论

MULSEL弥合了快速聚类方法和较慢的多序列比对方法之间的差距，并提供了从一种方法到另一种方法的无缝过渡。此外，它以图形方式呈现所得的简化家族，从而清楚地表明家族成员是否比对错误，或者是否存在看起来不一致的序列。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe09/4971687/418dbe0e8648/12859_2016_1059_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。

相似文献

1

Reduction, alignment and visualisation of large diverse sequence families.

BMC Bioinformatics. 2016 Aug 2;17(1):300. doi: 10.1186/s12859-016-1059-9.

2

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

3

CLUSS: clustering of protein sequences based on a new similarity measure.

BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.

4

ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach.

BMC Bioinformatics. 2014 Aug 7;15(1):265. doi: 10.1186/1471-2105-15-265.

5

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

6

Using CLUSTAL for multiple sequence alignments.

Methods Enzymol. 1996;266:383-402. doi: 10.1016/s0076-6879(96)66024-8.

7

T-Coffee: A novel method for fast and accurate multiple sequence alignment.

J Mol Biol. 2000 Sep 8;302(1):205-17. doi: 10.1006/jmbi.2000.4042.

8

PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids.

BMC Res Notes. 2015 May 7;8:187. doi: 10.1186/s13104-015-1152-6.

9

Fast and accurate clustering of noncoding RNAs using ensembles of sequence alignments and secondary structures.

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S48. doi: 10.1186/1471-2105-12-S1-S48.

10

SplitTester: software to identify domains responsible for functional divergence in protein family.

BMC Bioinformatics. 2005 Jun 1;6:137. doi: 10.1186/1471-2105-6-137.

引用本文的文献

1

Exploring RNA conformational space under sparse distance restraints.

Sci Rep. 2017 Mar 10;7:44074. doi: 10.1038/srep44074.

2

Protein multiple sequence alignment benchmarking through secondary structure prediction.

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

3

Molecular Models for the Core Components of the Flagellar Type-III Secretion Complex.

PLoS One. 2016 Nov 17;11(11):e0164047. doi: 10.1371/journal.pone.0164047. eCollection 2016.

本文引用的文献

1

The Pfam protein families database: towards a more sustainable future.

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

2

The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides.

Nucleic Acids Res. 2015 Jul 1;43(W1):W401-7. doi: 10.1093/nar/gkv485. Epub 2015 May 12.

3

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

4

Prediction of contacts from correlated sequence substitutions.

Curr Opin Struct Biol. 2013 Jun;23(3):473-9. doi: 10.1016/j.sbi.2013.04.001. Epub 2013 May 14.

5

Alignment-free sequence comparison based on next-generation sequencing reads.

J Comput Biol. 2013 Feb;20(2):64-79. doi: 10.1089/cmb.2012.0228.

6

MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16.

7

Rfam 11.0: 10 years of RNA families.

Nucleic Acids Res. 2013 Jan;41(Database issue):D226-32. doi: 10.1093/nar/gks1005. Epub 2012 Nov 3.

8

A novel hierarchical clustering algorithm for gene sequences.

BMC Bioinformatics. 2012 Jul 23;13:174. doi: 10.1186/1471-2105-13-174.

9

Ultrafast clustering algorithms for metagenomic sequence analysis.

Brief Bioinform. 2012 Nov;13(6):656-68. doi: 10.1093/bib/bbs035. Epub 2012 Jul 6.

10

Protein topology from predicted residue contacts.

Protein Sci. 2012 Feb;21(2):299-305. doi: 10.1002/pro.2002. Epub 2011 Dec 21.