College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China.
Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China.
Bioinformatics. 2022 Feb 7;38(5):1223-1230. doi: 10.1093/bioinformatics/btab811.
Multi-label (ML) protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as coronavirus disease 2019 (COVID-19).
The article proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition, encoding based on grouped weight, gene ontology, multi-scale continuous and discontinuous, residue probing transformation and evolutionary distance transformation. In the next part, we utilize the ML information latent semantic index method to avoid the interference of redundant information. In the end, ML learning with feature-induced labeling information enrichment is adopted to predict the ML protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy of the first four datasets are 99.23%, 93.82%, 93.24% and 96.72% by the leave-one-out cross validation. It is worth mentioning that the overall actual accuracy prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of ML protein, which provides new ideas for further research on the SCL of ML protein.
The source codes and datasets are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/.
Supplementary data are available at Bioinformatics online.
多标签(ML)蛋白质亚细胞定位(SCL)是研究蛋白质功能不可或缺的方法。它可以定位特定的蛋白质(例如,促进严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)入侵的人跨膜蛋白)或其表达产物在细胞中的特定位置,这可以为 2019 年冠状病毒病(COVID-19)等疾病的临床治疗提供参考。
本文提出了一种名为 ML-locMLFE 的新方法。首先,采用六种特征提取方法获取蛋白质有效信息。这些方法包括拟氨基酸组成、基于分组权重的编码、基因本体、多尺度连续和不连续、残基探测变换和进化距离变换。在接下来的部分中,我们利用 ML 信息潜在语义索引方法来避免冗余信息的干扰。最后,采用 ML 学习与特征诱导标记信息丰富相结合的方法来预测 ML 蛋白质 SCL。革兰氏阳性菌数据集作为训练集,革兰氏阴性菌数据集、病毒数据集、新植物数据集和 SARS-CoV-2 数据集作为测试集。通过留一交叉验证,前四个数据集的总体实际准确率分别为 99.23%、93.82%、93.24%和 96.72%。值得一提的是,我们的预测器对 SARS-CoV-2 数据集的整体实际准确率预测结果为 72.73%。结果表明,ML-locMLFE 方法在预测 ML 蛋白质的 SCL 方面具有明显的优势,为进一步研究 ML 蛋白质的 SCL 提供了新的思路。
源代码和数据集可在 https://github.com/QUST-AIBBDRC/ML-locMLFE/ 上公开获取。
补充数据可在 Bioinformatics 在线获取。