School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230601, China.
School of Computer Science, Northwestern Polytechnical University, Xi'an, 710129, Shaanxi, China.
BMC Bioinformatics. 2022 Dec 1;23(Suppl 7):518. doi: 10.1186/s12859-022-04880-y.
Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information.
In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins.
Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.
自相互作用蛋白(SIPs)是指由一个基因表达的两个或多个可相互作用的蛋白质副本,在大多数活细胞和细胞功能的调节中起着核心作用。尽管可以使用高通量实验技术提供大量的 SIPs 数据,但即使在今天,实验鉴定 SIPs 仍然存在耗时、昂贵、效率低和固有高假阳性率等缺点。因此,开发高效准确的自动方法作为实验方法的补充,以协助和加速从蛋白质序列信息预测 SIPs 的研究变得越来越重要。
本文提出了一种新的框架,称为 GLCM-WSRC(基于灰度共生矩阵加权稀疏表示的分类),用于基于蛋白质一级序列的蛋白质进化信息自动预测 SIPs。更具体地说,我们首先利用 Position Specific Iterated BLAST(PSI-BLAST)工具将蛋白质序列转换为包含蛋白质序列进化信息的位置特异性评分矩阵(PSSM)。其次,我们使用一种有效的特征提取方法,即灰度共生矩阵(GLCM),从 PSSM 中提取抽象的显著不变特征向量,然后进行预处理操作,即自适应合成(ADASYN)技术,以平衡 SIPs 数据集,生成新的特征向量用于分类。最后,我们采用一种高效可靠的 WSRC 模型,根据自相互作用和非相互作用蛋白质的已知信息来识别 SIPs。
大量的实验结果表明,该方法在酵母数据集上的准确率为 98.10%,在人类数据集上的准确率为 91.51%,这进一步表明该模型可以成为未来大规模自相互作用蛋白预测和其他生物信息学任务检测的有用工具。