Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan; Department of Computational Biology and Medical Sciences, University of Tokyo, Chiba 277-8568, Japan; Computational Bio Big Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan
Genome Res. 2024 Sep 20;34(8):1165-1173. doi: 10.1101/gr.279464.124.
The main way of analyzing genetic sequences is by finding sequence regions that are related to each other. There are many methods to do that, usually based on this idea: Find an alignment of two sequence regions, which would be unlikely to exist between unrelated sequences. Unfortunately, it is hard to tell if an alignment is likely to exist by chance. Also, the precise alignment of related regions is uncertain. One alignment does not hold all evidence that they are related. We should consider alternative alignments too. This is rarely done, because we lack a simple and fast method that fits easily into practical sequence-search software. Described here is the simplest-conceivable change to standard sequence alignment, which sums probabilities of alternative alignments and makes it easier to tell if a similarity is likely to occur by chance. This approach is better than standard alignment at finding distant relationships, at least in a few tests. It can be used in practical sequence-search software, with minimal increase in implementation difficulty or run time. It generalizes to different kinds of alignment, for example, DNA-versus-protein with frameshifts. Thus, it can widely contribute to finding subtle relationships between sequences.
分析遗传序列的主要方法是找到相互关联的序列区域。有许多方法可以做到这一点,通常基于这样的想法:找到两个序列区域的比对,这在不相关的序列之间不太可能存在。不幸的是,很难判断比对是否可能是偶然的。此外,相关区域的精确比对也不确定。一个比对并不能包含它们相关的所有证据。我们也应该考虑替代比对。这很少被做,因为我们缺乏一种简单而快速的方法,它很容易适应实用的序列搜索软件。这里描述的是标准序列比对中最简单的设想的改变,它对替代比对的概率进行求和,更容易判断相似性是否可能是偶然发生的。这种方法在发现远距离关系方面比标准比对要好,至少在一些测试中是这样。它可以在实用的序列搜索软件中使用,实现难度或运行时间的增加最小。它推广到不同类型的比对,例如带有移码的 DNA 与蛋白质比对。因此,它可以广泛有助于发现序列之间微妙的关系。