Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, China.
School of Public Health (Shenzhen), Sun Yat-Sen University, Guangzhou, 510006, China.
BMC Bioinformatics. 2021 Jun 23;22(Suppl 3):340. doi: 10.1186/s12859-021-04251-z.
Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance.
In this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC.
The experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent.
抗冻蛋白(AFPs)是一组抑制体液形成冰晶从而提高生物抗冻能力的蛋白质。它们对生活在极寒环境中的生物体的生存至关重要。然而,在结构和功能预测中,对于抗冻蛋白分类的序列特征提取和选择的研究很少,这具有重要意义。
在本文中,为了预测抗冻蛋白,提出了加权广义二肽组成(W-GDipC)的特征表示和基于两阶段和多回归方法(LRMR-Ri)的集成特征选择。具体来说,使用了四种特征选择算法:Lasso 回归、Ridge 回归、最大信息系数和 Relief 来分别选择特征集,这是 LRMR-Ri 方法的第一阶段。如果在这四个集合中存在一个公共特征子集,则该子集是最优的;否则,我们使用 Ridge 回归从四个集合的公共集合中选择最优子集,这是 LRMR-Ri 的第二阶段。LRMR-Ri 方法与 W-GDipC 结合,分别在抗冻蛋白数据集(二分类)和膜蛋白数据集(多分类)上进行了实验。实验结果表明,该方法在支持向量机(SVM)、决策树(DT)和随机梯度下降(SGD)中具有良好的性能。LRMR-Ri 和 W-GDipC 与抗冻蛋白数据集和 SVM 分类器的 ACC、RE 和 MCC 值分别高达 95.56%、97.06%和 0.9105,明显高于每种单一方法:Lasso、Ridge、Mic 和 Relief,对于 ACC,比单个 Lasso 高近 13%。
实验结果表明,与其他类似的单一特征方法相比,所提出的 LRMR-Ri 和 W-GDipC 方法可以显著提高抗冻蛋白预测的准确性。此外,我们的方法在膜蛋白的分类和预测中也取得了良好的效果,在一定程度上验证了其广泛的可靠性。