符号序列中符号间距离的概率分布：在提高文本中关键词检测和蛋白质中氨基酸聚类方面的应用。

Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain.

Phys Rev E. 2016 Nov;94(5-1):052302. doi: 10.1103/PhysRevE.94.052302. Epub 2016 Nov 4.

Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.

符号序列在过去几年中在统计物理学框架内得到了广泛的研究。这种序列的典型例子是书面文本以及脱氧核糖核酸 (DNA) 和蛋白质序列。在这些例子中，给定符号（单词、DNA 基序、氨基酸）的空间分布是一个关键属性，通常与序列中符号的重要性相关：符号的空间分布越不均匀且远离随机，符号与序列的相关性就越高。因此，许多分析技术以某种方式衡量符号空间分布相对于随机期望的偏差。问题是要知道随机的空间分布，通常认为是几何分布或指数分布。然而，这些分布仅适用于非常长的符号序列和分析符号的多次出现。在这里，我们分析地获得了适用于任何序列长度和任何符号频率的精确、随机预期的空间分布，并研究了它的主要性质。对分布的了解使我们能够定义一个可用于正确量化符号分布与随机性偏差的度量，尤其是对于短序列和低符号频率。我们将该度量应用于书面文本中的关键字检测问题以及研究蛋白质序列中的氨基酸聚类问题。在文本中，我们展示了当分析短文本时，该方法相对于先前方法的改进。在蛋白质中，由于其通常较短，我们展示了该度量如何明确地量化氨基酸聚类并表征其空间分布。