Buzzega Giovanni, Conte Alessio, Grossi Roberto, Punzi Giulia
Dipartimento di Informatica, Università di Pisa, Largo Pontecorvo 3, 56127, Pisa, Italy.
Algorithms Mol Biol. 2025 Apr 19;20(1):6. doi: 10.1186/s13015-025-00271-z.
Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements in MCSs into a practical tool called , the first publicly available tool that can index MCSs of real genomic data, and show that its definition can be generalized to multiple strings. We demonstrate that our tool can index pairs of sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes. For three or more sequences, we observe experimentally that the minimum index may exhibit a significant increase in the number of nodes.
分析和比较符号序列是计算机科学中最基本的问题之一,在生物信息学中可能更是如此。最大公共子序列(MCS),即两个或多个字符串共有的非连续符号的包含最大序列,尽管它是一个基本概念,并且是诸如最长公共子串/子序列等更常见工具的自然推广,但直到最近才在该领域受到关注。在本文中,我们将MCS的最新进展简化并设计成一个名为 的实用工具,这是第一个可公开获得的能够对真实基因组数据的MCS进行索引的工具,并表明其定义可以推广到多个字符串。我们证明,我们的工具可以在几分钟内对超过10,000个碱基对的序列对进行索引,使用的节点仅比所需的最少节点多4 - 7%。对于三个或更多序列,我们通过实验观察到,最小索引可能会使节点数量显著增加。