Sanchez Victoria, Peinado Antonio M, Pérez-Córdoba Jose L, Gómez Angel M
Department of Signal Theory, Networking and Communications, Universidad de Granada, 18071 Granada, Spain.
J Bioinform Comput Biol. 2015 Oct;13(5):1550024. doi: 10.1142/S0219720015500249. Epub 2015 Aug 21.
Most of the algorithms used for information extraction and for processing the amino acid chains that make up proteins treat them as symbolic chains. Fewer algorithms exploit signal processing techniques that require a numerical representation of amino acid chains. However, these algorithms are very powerful for extracting regularities that cannot be detected when working with a symbolic chain, which may be important for understanding the biological meaning of a sequence or in classification tasks. In this study, a new mathematical representation of amino acid chains is proposed, which is derived using a similarity measure based on the PAM250 amino acid substitution matrix and that generates 20 signals for each protein sequence. Using this representation 20 consensus spectra for a protein family are determined and the relevance of the frequency peaks is established, obtaining a group of significant frequency peaks that manifest common periodicities of the amino acid sequences that belong to a protein family. We also show that the proposed representation in 20 signals can be integrated into Chou's pseudo amino acid composition (PseAAC) and constitute a useful alternative to amino acid physicochemical properties in Chou's PseAAC.
大多数用于信息提取和处理构成蛋白质的氨基酸链的算法都将它们视为符号链。较少有算法利用需要氨基酸链数值表示的信号处理技术。然而,这些算法在提取使用符号链时无法检测到的规律方面非常强大,这对于理解序列的生物学意义或在分类任务中可能很重要。在本研究中,提出了一种新的氨基酸链数学表示,它是基于PAM250氨基酸替换矩阵使用相似性度量推导出来的,并且为每个蛋白质序列生成20个信号。使用这种表示确定了一个蛋白质家族的20个共有谱,并确定了频率峰的相关性,获得了一组显著的频率峰,这些频率峰表现出属于一个蛋白质家族的氨基酸序列的共同周期性。我们还表明,所提出的20信号表示可以整合到周的伪氨基酸组成(PseAAC)中,并构成周的PseAAC中氨基酸物理化学性质的有用替代。