School of Computer Science and Technology, Xidian University, Xian, Shaanxi 710071, China.
School of Computer Science and Engineering, Guilin University of Aerospace Technology, Guilin, Guangxi 541004, China.
Bioinformatics. 2020 May 1;36(10):3035-3042. doi: 10.1093/bioinformatics/btaa134.
Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining. The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption. To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed.
In PRDAG, we transform the MLCS problem into searching the longest path from the Directed Acyclic Graph (DAG), where each longest path in DAG corresponds to an MLCS. To tackle the problem efficiently, we eliminate all redundant and repeated nodes during the construction of DAG, and for each node, we only maintain the longest paths from the source node to it but ignore all non-longest paths. As a result, the size of the DAG becomes very small, and the memory space and search time will be greatly saved. Empirical experiments have been performed on a standard benchmark set of both DNA sequences and protein sequences. The experimental results demonstrate that our model and algorithm outperform the related leading algorithms, especially for large-scale MLCS problems.
This program code is written by the first author and can be available at https://www.ncbi.nlm.nih.gov/nuccore and https://blog.csdn.net/wswguilin.
Supplementary data are available at Bioinformatics online.
搜索多个序列的最长公共子序列被称为多重最长公共子序列(MLCS)问题,这是数据挖掘领域的一个非常基本和具有挑战性的问题。由于时间和空间消耗巨大,现有的算法无法应用于长规模和大规模的序列问题。为了有效地处理大规模 MLCS 问题,提出了路径记录有向无环图(PRDAG)模型和一种新的路径记录算法(PRA)。
在 PRDAG 中,我们将 MLCS 问题转化为从有向无环图(DAG)中搜索最长路径,其中 DAG 中的每个最长路径对应一个 MLCS。为了有效地解决这个问题,我们在构建 DAG 时消除了所有冗余和重复的节点,并且对于每个节点,我们只维护从源节点到它的最长路径,而忽略所有非最长路径。因此,DAG 的大小变得非常小,内存空间和搜索时间将大大节省。在 DNA 序列和蛋白质序列的标准基准集上进行了实验。实验结果表明,我们的模型和算法优于相关的领先算法,特别是对于大规模 MLCS 问题。
该程序代码由第一作者编写,可以在 https://www.ncbi.nlm.nih.gov/nuccore 和 https://blog.csdn.net/wswguilin 上获得。
补充数据可在生物信息学在线获得。