IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):1235-1244. doi: 10.1109/TCBB.2020.3010975. Epub 2022 Apr 1.
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.
生物体直接从细胞呼吸中获得必要的能量物质。电子储存和运输的完成需要在电子传递链的辅助下进行细胞呼吸过程。因此,不可避免地需要破译电子传递蛋白的工作。这些具有高性能的蛋白质的鉴定迫切依赖于特征提取方法和机器学习算法的选择。在这项研究中,蛋白质序列被用作包含单词的自然语言句子。基于提名词嵌入的特征集,基于词嵌入调制和蛋白质基序频率,对于特征选择很有用。检查了五种词嵌入类型和多种联合特征,以进行这种特征选择。然后使用支持向量机算法进行分类。包括平均准确率、特异性、敏感性和 MCC 率在内的 5 倍交叉验证中的性能统计值均超过 0.95。独立测试中的这些指标分别为 96.82%、97.16%、95.76%和 0.9。与最先进的预测器相比,该方法在所有指标上都能产生更优的性能,表明该方法在确定电子传递蛋白方面的有效性。此外,这项研究揭示了各种词嵌入在理解所调查序列方面的适用性。