• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

WITCH-NG:对具有序列长度异质性的数据集进行高效且准确的比对。

WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity.

作者信息

Liu Baqiao, Warnow Tandy

机构信息

Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA.

出版信息

Bioinform Adv. 2023 Mar 6;3(1):vbad024. doi: 10.1093/bioadv/vbad024. eCollection 2023.

DOI:10.1093/bioadv/vbad024
PMID:36970502
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10035637/
Abstract

SUMMARY

Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith-Waterman. Our new method, WITCH-NG (i.e. 'next generation WITCH') achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG.

AVAILABILITY AND IMPLEMENTATION

The datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

摘要

多序列比对是许多生物信息学流程的基本组成部分,包括系统发育估计、RNA和蛋白质结构预测以及宏基因组序列分析。然而,许多序列数据集存在显著的序列长度异质性,这既是由于序列进化历史中的大量插入和缺失,也是由于输入中包含未组装的读段或组装不完整的序列。已经开发了一些方法,这些方法在比对具有序列长度异质性的数据集时可以达到很高的准确性,UPP是最早获得良好准确性的方法之一,而WITCH是对UPP准确性的最新改进。在本文中,我们展示了如何加速WITCH。我们的改进包括用使用Smith-Waterman的多项式时间精确算法取代WITCH中的一个关键步骤(目前使用启发式搜索执行)。我们的新方法WITCH-NG(即“下一代WITCH”)实现了相同的准确性,但速度要快得多。WITCH-NG可在https://github.com/RuneBlaze/WITCH-NG上获取。

可用性和实现

本研究中使用的数据集来自先前的出版物,可在公共存储库中免费获取,如补充材料中所示。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/810278549579/vbad024f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/605d816d624e/vbad024f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/2ab0e4e36bea/vbad024f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/a921a47fecbe/vbad024f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/c0b2dfa2c7f1/vbad024f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/810278549579/vbad024f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/605d816d624e/vbad024f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/2ab0e4e36bea/vbad024f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/a921a47fecbe/vbad024f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/c0b2dfa2c7f1/vbad024f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9252/10035637/810278549579/vbad024f5.jpg

相似文献

1
WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity.WITCH-NG:对具有序列长度异质性的数据集进行高效且准确的比对。
Bioinform Adv. 2023 Mar 6;3(1):vbad024. doi: 10.1093/bioadv/vbad024. eCollection 2023.
2
MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences.MAGUS+隐马尔可夫模型:提高了片段序列的多序列比对准确性。
Bioinformatics. 2022 Jan 27;38(4):918-924. doi: 10.1093/bioinformatics/btab788.
3
HMMerge: an ensemble method for multiple sequence alignment.HMMerge:一种用于多序列比对的集成方法。
Bioinform Adv. 2023 Apr 17;3(1):vbad052. doi: 10.1093/bioadv/vbad052. eCollection 2023.
4
UPP2: fast and accurate alignment of datasets with fragmentary sequences.UPP2:快速准确地对齐具有片段序列的数据集。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.
5
WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment.WITCH:通过加权一致隐马尔可夫模型比对改进多序列比对
J Comput Biol. 2022 Aug;29(8):782-801. doi: 10.1089/cmb.2021.0585. Epub 2022 May 17.
6
Phylogeny Estimation Given Sequence Length Heterogeneity.给定序列长度异质性的系统发育估计。
Syst Biol. 2021 Feb 10;70(2):268-282. doi: 10.1093/sysbio/syaa058.
7
SWORD-a highly efficient protein database search.SWORD——一种高效的蛋白质数据库搜索工具。
Bioinformatics. 2016 Sep 1;32(17):i680-i684. doi: 10.1093/bioinformatics/btw445.
8
App-SpaM: phylogenetic placement of short reads without sequence alignment.App-SpaM:无需序列比对的短读段系统发育定位
Bioinform Adv. 2021 Oct 13;1(1):vbab027. doi: 10.1093/bioadv/vbab027. eCollection 2021.
9
EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences.耳环法:一种高效且准确的衔接子修剪方法不需要预先知道衔接子序列。
Bioinformatics. 2021 Jul 27;37(13):1846-1852. doi: 10.1093/bioinformatics/btab025.
10
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

引用本文的文献

1
EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment.EMMA:一种在给定约束子集比对的情况下计算多序列比对的新方法。
Algorithms Mol Biol. 2023 Dec 7;18(1):21. doi: 10.1186/s13015-023-00247-x.

本文引用的文献

1
UPP2: fast and accurate alignment of datasets with fragmentary sequences.UPP2:快速准确地对齐具有片段序列的数据集。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.
2
Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade.基于系统发育定位的宏基因组分析——首个十年综述
Front Bioinform. 2022 May 26;2:871393. doi: 10.3389/fbinf.2022.871393. eCollection 2022.
3
Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem.大规模多重序列比对和最大权重轨迹比对合并问题。
IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):1700-1712. doi: 10.1109/TCBB.2022.3191848. Epub 2023 Jun 5.
4
WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment.WITCH:通过加权一致隐马尔可夫模型比对改进多序列比对
J Comput Biol. 2022 Aug;29(8):782-801. doi: 10.1089/cmb.2021.0585. Epub 2022 May 17.
5
MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences.MAGUS+隐马尔可夫模型:提高了片段序列的多序列比对准确性。
Bioinformatics. 2022 Jan 27;38(4):918-924. doi: 10.1093/bioinformatics/btab788.
6
MAGUS: Multiple sequence Alignment using Graph clUStering.MAGUS:基于图聚类的多重序列比对。
Bioinformatics. 2021 Jul 19;37(12):1666-1672. doi: 10.1093/bioinformatics/btaa992.
7
Phylogeny Estimation Given Sequence Length Heterogeneity.给定序列长度异质性的系统发育估计。
Syst Biol. 2021 Feb 10;70(2):268-282. doi: 10.1093/sysbio/syaa058.
8
HIPPI: highly accurate protein family classification with ensembles of HMMs.HIPPI:利用隐马尔可夫模型集合进行高精度蛋白质家族分类
BMC Genomics. 2016 Nov 11;17(Suppl 10):765. doi: 10.1186/s12864-016-3097-0.
9
Ultra-large alignments using phylogeny-aware profiles.使用系统发育感知概况的超大比对。
Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z.
10
PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.PASTA:用于核苷酸和氨基酸序列的超大多重序列比对
J Comput Biol. 2015 May;22(5):377-86. doi: 10.1089/cmb.2014.0156. Epub 2014 Dec 30.