Apostolico Alberto, Bock Mary Ellen, Lonardi Stefano
Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA.
J Comput Biol. 2003;10(3-4):283-311. doi: 10.1089/10665270360688020.
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
为了压缩数据、揭示结构、推断简洁描述、提取和分类特征等,人们以各种方式研究表征和检测诸如子串或基序等重复序列模式以及相关关联或规则的问题。在分子生物学中,生物序列中异常频繁或罕见的单词与生物功能和结构的各个方面都有关联。此类模式的发现,尤其是大规模的发现,带来了有趣的方法学和算法问题,并且常常揭示出表格和概要比它们旨在概括的原始序列增长得更快、更大的情况。在先前的研究中,简洁地计算、存储和显示异常子串的能力与单词子词的组合学和用于衡量与预期偏差的某些分数的局部单调性之间的微妙相互作用有关。在本文中,我们对更广泛的各种分数的此类单调性进行了广泛分析。这支持了能够在各种概率模型下,以与主题序列成线性的时间和空间对异常子串进行全局检测的数据结构和算法的构建。