IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):292-300. doi: 10.1109/TCBB.2017.2765331. Epub 2017 Oct 23.
Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith-Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.
蛋白质远程同源检测和折叠识别是研究蛋白质结构和功能的两个关键任务。目前,基于轮廓的方法在这些领域中达到了最先进的性能。然而,广泛使用的序列轮廓,如位置特异性频率矩阵 (PSFM) 和位置特异性评分矩阵 (PSSM),忽略了蛋白质序列中沿序列顺序的效应。在这项研究中,我们提出了一种新的轮廓,称为序列顺序频率矩阵 (SOFM),从多重序列比对 (MSA) 中提取相邻残基的序列顺序信息。结合两种轮廓特征提取方法,即 top-n-grams 和 Smith-Waterman 算法,将 SOFMs 应用于蛋白质远程同源检测和折叠识别,并提出了两个名为 SOFM-Top 和 SOFM-SW 的预测器。实验结果表明,SOFM 比其他轮廓包含更多的信息内容,这两个预测器优于其他最先进的方法。预计 SOFM 将成为研究蛋白质结构和功能的非常有用的轮廓。