IEEE/ACM Trans Comput Biol Bioinform. 2018 Jan-Feb;15(1):117-128. doi: 10.1109/TCBB.2016.2620143. Epub 2016 Oct 21.
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.
熵与重复性和可压缩性密切相关,是一种广泛用于评估序列可预测程度的与信息相关的度量。熵谱基于信息论原理,可用于研究子词的欠/过表示,同时还提供有关保守 DNA 区域规模的信息。在这里,我们专注于与熵谱相关的算法方面。特别是,我们提出了基于后缀的基于数据结构的线性时间算法,更具体地说,是基于截断后缀树(TST)和增强后缀数组(ESA)。我们进行了广泛的实验活动,表明我们的算法不仅更快,而且可以分析更长的序列,即使对于更高的分辨率,也比最先进的算法能够实现。