Prosperi Mattia, Marini Simone, Boucher Christina
dept. of Epidemiology, University of Florida, Gainesville, FL (USA).
dept. of CISE, University of Florida, Gainesville, FL (USA).
Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:93-102. doi: 10.1109/ichi61247.2024.00020. Epub 2024 Aug 22.
A problem extension of the longest common substring (LCS) between two texts is the enumeration of all LCSs given a minimum length (ALCS- ), along with their positions in each text. In bioinformatics, an efficient solution to the ALCS- for very long texts -genomes or metagenomes- can provide useful insights to discover genetic signatures responsible for biological mechanisms. The ALCS- problem has two additional requirements compared to the LCS problem: one is the minimum length , and the other is that all common strings longer than must be reported. We present an efficient, two-stage ALCS- algorithm exploiting the spectrum of text substrings of length ( -mers). Our approach yields a worst-case time complexity loglinear in the number of -mers for the first stage, and an average-case loglinear in the number of common -mers for the second stage (several orders of magnitudes smaller than the total -mer spectrum). The space complexity is linear in the first phase (disk-based), and on average linear in the second phase (disk- and memory-based). Tests performed on genomes for different organisms (including viruses, bacteria and animal chromosomes) show that run times are consistent with our theoretical estimates; further, comparisons with MUMmer4 show an asymptotic advantage with divergent genomes.
两个文本之间最长公共子串(LCS)的一个问题扩展是,在给定最小长度(ALCS - )的情况下,枚举所有LCS及其在每个文本中的位置。在生物信息学中,针对非常长的文本(基因组或宏基因组)的ALCS - 的有效解决方案,可以为发现负责生物机制的遗传特征提供有用的见解。与LCS问题相比,ALCS - 问题有两个额外的要求:一个是最小长度 ,另一个是必须报告所有长度大于 的公共字符串。我们提出了一种高效的两阶段ALCS - 算法,该算法利用长度为 ( - 聚体)的文本子串频谱。我们的方法在第一阶段产生的最坏情况时间复杂度与 - 聚体数量成对数线性关系,在第二阶段与公共 - 聚体数量成平均情况对数线性关系(比总 - 聚体频谱小几个数量级)。空间复杂度在第一阶段是线性的(基于磁盘),在第二阶段平均是线性的(基于磁盘和内存)。对不同生物体(包括病毒、细菌和动物染色体)的基因组进行的测试表明,运行时间与我们的理论估计一致;此外,与MUMmer4的比较表明,在分歧基因组方面具有渐近优势。