Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan.
Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan.
Bioinformatics. 2020 Jan 15;36(2):408-415. doi: 10.1093/bioinformatics/btz576.
Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments.
This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment.
Supplementary data are available at Bioinformatics online.
序列比对在生物信息学中仍然是基础。传统的两两比对是基于替换、插入和缺失的特定分数,但也可以基于概率模型(对隐马尔可夫模型:PHMMs)。PHMMs 使我们能够:根据每种数据拟合参数,计算比对部分的可靠性,并测量在可能的比对中整合的序列相似性。
本研究表明了多个模型如何对应于一组分数。分数可以通过具有“温度”参数的分区函数转换为概率:对于任何温度,这对应于某个 PHMM。存在一类具有平衡长度概率的特殊模型,即不对较长或较短的比对有偏见。评分比对并评估其重要性的最佳方法取决于目的:判断整个序列是否相关,还是寻找相关部分。这阐明了序列比对的统计基础。
补充数据可在 Bioinformatics 在线获得。