序列比对分数如何对应概率模型。

How sequence alignment scores correspond to probability models.

机构信息

Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan.

Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan.

出版信息

Bioinformatics. 2020 Jan 15;36(2):408-415. doi: 10.1093/bioinformatics/btz576.

DOI:10.1093/bioinformatics/btz576

PMID:31329241

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9883716/

Abstract

MOTIVATION

Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments.

RESULTS

This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

序列比对在生物信息学中仍然是基础。传统的两两比对是基于替换、插入和缺失的特定分数，但也可以基于概率模型（对隐马尔可夫模型：PHMMs）。PHMMs 使我们能够：根据每种数据拟合参数，计算比对部分的可靠性，并测量在可能的比对中整合的序列相似性。

结果

本研究表明了多个模型如何对应于一组分数。分数可以通过具有“温度”参数的分区函数转换为概率：对于任何温度，这对应于某个 PHMM。存在一类具有平衡长度概率的特殊模型，即不对较长或较短的比对有偏见。评分比对并评估其重要性的最佳方法取决于目的：判断整个序列是否相关，还是寻找相关部分。这阐明了序列比对的统计基础。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

How sequence alignment scores correspond to probability models.序列比对分数如何对应概率模型。

Bioinformatics. 2020 Jan 15;36(2):408-415. doi: 10.1093/bioinformatics/btz576.

Pair hidden Markov models on tree structures.树结构上的成对隐马尔可夫模型。

Bioinformatics. 2003;19 Suppl 1:i232-40. doi: 10.1093/bioinformatics/btg1032.

MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities.MSAProbs：基于对隐马尔可夫模型和分区函数后验概率的多重序列比对。

Bioinformatics. 2010 Aug 15;26(16):1958-64. doi: 10.1093/bioinformatics/btq338. Epub 2010 Jun 23.

Sequence alignments and pair hidden Markov models using evolutionary history.使用进化历史的序列比对和配对隐马尔可夫模型。

J Mol Biol. 2003 Oct 17;333(2):453-60. doi: 10.1016/j.jmb.2003.08.015.

Bioinformatics. 2018 Feb 15;34(4):576-584. doi: 10.1093/bioinformatics/btx643.

pHMM-tree: phylogeny of profile hidden Markov models.pHMM树：轮廓隐马尔可夫模型的系统发育

Bioinformatics. 2017 Apr 1;33(7):1093-1095. doi: 10.1093/bioinformatics/btw779.

General continuous-time Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable?通过插入/缺失进行序列进化的一般连续时间马尔可夫模型：比对概率是否可分解？

BMC Bioinformatics. 2016 Aug 11;17:304. doi: 10.1186/s12859-016-1105-7.

ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function.ProbPFP：一种通过粒子群优化算法优化的隐马尔可夫模型与分区函数相结合的多序列比对算法。

BMC Bioinformatics. 2019 Nov 25;20(Suppl 18):573. doi: 10.1186/s12859-019-3132-7.

Using hidden Markov models to align multiple sequences.使用隐马尔可夫模型对多个序列进行比对。

Cold Spring Harb Protoc. 2009 Jul;2009(7):pdb.top41. doi: 10.1101/pdb.top41.

COACH: profile-profile alignment of protein families using hidden Markov models.COACH：使用隐马尔可夫模型对蛋白质家族进行轮廓-轮廓比对。

Bioinformatics. 2004 May 22;20(8):1309-18. doi: 10.1093/bioinformatics/bth091. Epub 2004 Feb 12.

引用本文的文献

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation.用于隐私保护合成基因组序列生成的全基因组信息语言模型

bioRxiv. 2024 Sep 24:2024.09.18.612131. doi: 10.1101/2024.09.18.612131.

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.带标签的种子链扩展对齐标注的 De Bruijn 图。

Bioinformatics. 2024 Jun 28;40(Suppl 1):i337-i346. doi: 10.1093/bioinformatics/btae226.

nail: software for high-speed, high-sensitivity protein sequence annotation.NAIL：用于高速、高灵敏度蛋白质序列注释的软件。

bioRxiv. 2024 Jan 30:2024.01.27.577580. doi: 10.1101/2024.01.27.577580.

Short-read aligner performance in germline variant identification.短读比对工具在种系变异识别中的性能表现。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad480.

Topiary: Pruning the manual labor from ancestral sequence reconstruction.树篱：从祖先序列重建中削减人工劳动。

Protein Sci. 2023 Feb;32(2):e4551. doi: 10.1002/pro.4551.

Transposable element subfamily annotation has a reproducibility problem.转座元件亚家族注释存在可重复性问题。

Mob DNA. 2021 Jan 23;12(1):4. doi: 10.1186/s13100-021-00232-4.

本文引用的文献

Introducing difference recurrence relations for faster semi-global alignment of long sequences.引入差异递归关系以加快长序列的半全局比对。

BMC Bioinformatics. 2018 Feb 19;19(Suppl 1):45. doi: 10.1186/s12859-018-2014-8.

A survey of localized sequence rearrangements in human DNA.人类 DNA 局部序列重排的调查。

Nucleic Acids Res. 2018 Feb 28;46(4):1661-1673. doi: 10.1093/nar/gkx1266.

Split-alignment of genomes finds orthologies more accurately.基因组的分裂比对能更准确地找到直系同源基因。

Genome Biol. 2015 May 21;16(1):106. doi: 10.1186/s13059-015-0670-9.

A mostly traditional approach improves alignment of bisulfite-converted DNA.一种主要的传统方法可以提高亚硫酸氢盐转化 DNA 的对齐度。

Nucleic Acids Res. 2012 Jul;40(13):e100. doi: 10.1093/nar/gks275. Epub 2012 Mar 28.

Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.利用序列间 SIMD 并行化实现更快的 Smith-Waterman 数据库搜索。

BMC Bioinformatics. 2011 Jun 1;12:221. doi: 10.1186/1471-2105-12-221.

A new repeat-masking method enables specific detection of homologous sequences.一种新的重复序列屏蔽方法可实现同源序列的特异性检测。

Nucleic Acids Res. 2011 Mar;39(4):e23. doi: 10.1093/nar/gkq1212. Epub 2010 Nov 24.

A new generation of homology search tools based on probabilistic inference.基于概率推理的新一代同源性搜索工具。

Genome Inform. 2009 Oct;23(1):205-11.

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES.通过带停止时间的重要性抽样估计随机序列局部比对的耿贝尔尺度参数。

Ann Stat. 2009 Dec 1;37(6A):3697. doi: 10.1214/08-AOS663.

Incorporating sequence quality data into alignment improves DNA read mapping.将序列质量数据纳入比对可提高 DNA 读取的映射质量。

Nucleic Acids Res. 2010 Apr;38(7):e100. doi: 10.1093/nar/gkq010. Epub 2010 Jan 27.

A probabilistic model of local sequence alignment that simplifies statistical significance estimation.一种简化统计显著性估计的局部序列比对概率模型。

PLoS Comput Biol. 2008 May 30;4(5):e1000069. doi: 10.1371/journal.pcbi.1000069.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

序列比对分数如何对应概率模型。

How sequence alignment scores correspond to probability models.

机构信息

出版信息

MOTIVATION

RESULTS

SUPPLEMENTARY INFORMATION

动机

结果

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献