• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

COATi:蛋白质编码序列的统计成对比对。

COATi: Statistical Pairwise Alignment of Protein-Coding Sequences.

机构信息

The Biodesign Institute, Arizona State University, Tempe, AZ, USA.

Ira A. Fulton Schools of Engineering, Arizona State University, Tempe, AZ, USA.

出版信息

Mol Biol Evol. 2024 Jul 3;41(7). doi: 10.1093/molbev/msae117.

DOI:10.1093/molbev/msae117
PMID:38869090
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11255384/
Abstract

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequencing artifacts and errors made during genome assembly, such as abiological frameshifts and incorrect early stop codons, can impact downstream analyses leading to erroneous conclusions in comparative and functional genomic studies. More significantly, while indels can occur both within and between codons in natural sequences, most amino-acid- and codon-based aligners assume that indels only occur between codons. This mismatch between biology and alignment algorithms produces suboptimal alignments and errors in downstream analyses. To address these issues, we present COATi, a statistical, codon-aware pairwise aligner that supports complex insertion-deletion models and can handle artifacts present in genomic data. COATi allows users to reduce the amount of discarded data while generating more accurate sequence alignments. COATi can infer indels both within and between codons, leading to improved sequence alignments. We applied COATi to a dataset containing orthologous protein-coding sequences from humans and gorillas and conclude that 41% of indels occurred between codons, agreeing with previous work in other species. We also applied COATi to semiempirical benchmark alignments and find that it outperforms several popular alignment programs on several measures of alignment quality and accuracy.

摘要

序列比对是生物信息学中的一种基本方法,也是许多分析的基础,包括系统发育推断、祖先序列重建和基因注释。在基因组组装过程中产生的测序伪影和错误,如非生物框架移位和不正确的早期终止密码子,会影响下游分析,导致比较和功能基因组研究中的错误结论。更重要的是,虽然在自然序列中插入缺失可以发生在密码子内和密码子之间,但大多数基于氨基酸和密码子的比对器假设插入缺失只发生在密码子之间。这种生物学和比对算法之间的不匹配会产生次优的比对和下游分析中的错误。为了解决这些问题,我们提出了 COATi,一种统计的、基于密码子的两两比对器,它支持复杂的插入缺失模型,并可以处理基因组数据中存在的伪影。COATi 允许用户在生成更准确的序列比对的同时减少丢弃的数据量。COATi 可以推断密码子内和密码子之间的插入缺失,从而导致更好的序列比对。我们将 COATi 应用于包含人类和大猩猩同源蛋白编码序列的数据集,并得出结论,41%的插入缺失发生在密码子之间,这与其他物种的先前工作一致。我们还将 COATi 应用于半经验基准比对,并发现它在几个比对质量和准确性的度量上优于几个流行的比对程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/f423f7f37e62/msae117f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/9981e2c54437/msae117f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/ca7f97c3f19a/msae117f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/b45a877d7e2e/msae117f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/cea677817a23/msae117f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/b17c76ab2869/msae117f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/f423f7f37e62/msae117f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/9981e2c54437/msae117f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/ca7f97c3f19a/msae117f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/b45a877d7e2e/msae117f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/cea677817a23/msae117f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/b17c76ab2869/msae117f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4383/11255384/f423f7f37e62/msae117f6.jpg

相似文献

1
COATi: Statistical Pairwise Alignment of Protein-Coding Sequences.COATi:蛋白质编码序列的统计成对比对。
Mol Biol Evol. 2024 Jul 3;41(7). doi: 10.1093/molbev/msae117.
2
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.MACSE:考虑移码和终止密码子的编码序列多重比对。
PLoS One. 2011;6(9):e22594. doi: 10.1371/journal.pone.0022594. Epub 2011 Sep 16.
3
webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser.webPRANK:一个具有互动比对浏览器的系统发生感知多重序列比对程序。
BMC Bioinformatics. 2010 Nov 26;11:579. doi: 10.1186/1471-2105-11-579.
4
Aligning Protein-Coding Nucleotide Sequences with MACSE.使用MACSE比对蛋白质编码核苷酸序列。
Methods Mol Biol. 2021;2231:51-70. doi: 10.1007/978-1-0716-1036-7_4.
5
Indel reliability in indel-based phylogenetic inference.基于插入缺失的系统发育推断中插入缺失的可靠性。
Genome Biol Evol. 2014 Nov 18;6(12):3199-209. doi: 10.1093/gbe/evu252.
6
The Cumulative Indel Model: Fast and Accurate Statistical Evolutionary Alignment.累积插入缺失模型:快速准确的统计进化比对。
Syst Biol. 2021 Feb 10;70(2):236-257. doi: 10.1093/sysbio/syaa050.
7
Empirical codon substitution matrix.经验密码子替换矩阵。
BMC Bioinformatics. 2005 Jun 1;6:134. doi: 10.1186/1471-2105-6-134.
8
Bayesian coestimation of phylogeny and sequence alignment.系统发育与序列比对的贝叶斯联合估计
BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.
9
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign:利用氨基酸促进蛋白质编码DNA序列的多重比对。
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.
10
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.

本文引用的文献

1
A Model of Indel Evolution by Finite-State, Continuous-Time Machines.有限状态连续时间机器的插入缺失进化模型。
Genetics. 2020 Dec;216(4):1187-1204. doi: 10.1534/genetics.120.303630. Epub 2020 Oct 5.
2
Machine Boss: rapid prototyping of bioinformatic automata.机器老板:生物信息自动机的快速原型制作。
Bioinformatics. 2021 Apr 9;37(1):29-35. doi: 10.1093/bioinformatics/btaa633.
3
The Cumulative Indel Model: Fast and Accurate Statistical Evolutionary Alignment.累积插入缺失模型:快速准确的统计进化比对。
Syst Biol. 2021 Feb 10;70(2):236-257. doi: 10.1093/sysbio/syaa050.
4
Tigmint: correcting assembly errors using linked reads from large molecules.Tigmint:使用来自大分子量的连锁读取来修正组装错误。
BMC Bioinformatics. 2018 Oct 26;19(1):393. doi: 10.1186/s12859-018-2425-6.
5
MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.MACSE v2:用于对齐编码序列的工具包,考虑到移码和终止密码子。
Mol Biol Evol. 2018 Oct 1;35(10):2582-2584. doi: 10.1093/molbev/msy159.
6
ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R.ape 5.0:R 中的现代系统发育学和进化分析环境。
Bioinformatics. 2019 Feb 1;35(3):526-528. doi: 10.1093/bioinformatics/bty633.
7
Phylogeny-aware alignment with PRANK.使用PRANK进行系统发育感知比对。
Methods Mol Biol. 2014;1079:155-70. doi: 10.1007/978-1-62703-646-7_10.
8
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.MAFFT 多序列比对软件版本 7:性能和易用性的改进。
Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16.
9
Measuring the distance between multiple sequence alignments.测量多个序列比对之间的距离。
Bioinformatics. 2012 Feb 15;28(4):495-502. doi: 10.1093/bioinformatics/btr701. Epub 2011 Dec 23.
10
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.使用 Clustal Omega 快速、可扩展地生成高质量蛋白质多重序列比对。
Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.