一种应用于压缩基因组重测序数据的“最长公共子序列问题”新算法。

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

作者信息

Beal Richard, Afrin Tazin, Farheen Aliya, Adjeroh Donald

机构信息

Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA.

出版信息

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):544. doi: 10.1186/s12864-016-2793-0.

DOI:10.1186/s12864-016-2793-0

PMID:27556803

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5001248/

Abstract

BACKGROUND

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data.

METHODS

First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself.

RESULTS

Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011).

CONCLUSION

We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.

摘要

背景

最长公共子序列（LCS）问题是计算机科学中的一个经典问题，也是当前用于基因组重测序数据的最佳参考压缩方案的基础。

方法

首先，我们提出了一种解决LCS问题的新算法。利用广义后缀树，我们识别两个输入序列之间共享的公共子串。利用最大公共子串，我们构建一个有向无环图（DAG），并基于此将LCS确定为DAG中的最长路径。然后，我们引入一种基于LCS的参考压缩方案，该方案使用LCS的组成部分，而不是LCS本身。

结果

我们的基本方案将人类基因组（原始大小为3,080,436,051字节）压缩到15,460,478字节。对基本方法的改进进一步将其减少到8,556,708字节，总体压缩率为360。相比之下，之前的最先进压缩率分别为157（Wang和Zhang，2011）和171（Pinho、Pratas和Garcia，2011）。