Zhang Yanping, Gao Ya, Ni Jianwei, Chen Pengcheng, Wang Xiaosheng
School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038, China.
Comb Chem High Throughput Screen. 2021;24(10):1746-1753. doi: 10.2174/1386207323999201117111738.
Based on protein sequence information, a simple and effective method was used to analyze protein sequence similarity and predict DNA-binding protein.
It is absolutely necessary that we generate computational methods of low complexity to accurate infer protein structure, function, and evolution in the rapidly growing number of molecular biology data available.
It is important to generate novel computational algorithms for analyzing and comparing protein sequences with the rapidly growing number of molecular biology data available.
Based on global and local position representation with the curves of Fermat spiral and normalized moments of inertia of the curve of Fermat spiral, respectively, moreover, composition of 20 amino acids to get the numerical characteristics of protein sequences.
It has been applied to analyze the similarity/dissimilarity of nine ND5 proteins, the analysis results are consistent with the biological evolution theory. Furthermore, we employ the Logistic regression with 5-fold cross-validation to establish the prediction of DNA-binding proteins model, which outperformed the DNAbinder, iDNA-prot, DNA-prot and gDNA-prot by 0.0069-0.609 in terms of F-measure, 0.293-0.898 in terms of MCC in unbalanced dataset.
These results show that our method, namely FermatS, is effective to compare, recognition and prediction the protein sequences.
基于蛋白质序列信息,采用一种简单有效的方法分析蛋白质序列相似性并预测DNA结合蛋白。
鉴于现有分子生物学数据数量迅速增长,生成低复杂度的计算方法以准确推断蛋白质结构、功能和进化是绝对必要的。
鉴于现有分子生物学数据数量迅速增长,生成用于分析和比较蛋白质序列的新型计算算法很重要。
分别基于费马螺旋曲线的全局和局部位置表示以及费马螺旋曲线的归一化惯性矩,此外,采用20种氨基酸的组成来获取蛋白质序列的数值特征。
该方法已应用于分析9种ND5蛋白的相似性/差异性,分析结果与生物进化理论一致。此外,我们采用5折交叉验证的逻辑回归建立DNA结合蛋白预测模型,在不平衡数据集中,该模型在F值方面比DNAbinder、iDNA-prot、DNA-prot和gDNA-prot高出0.0069 - 0.609,在马修斯相关系数方面高出0.293 - 0.898。
这些结果表明我们的方法,即FermatS,在比较、识别和预测蛋白质序列方面是有效的。