基于集成方法的糖尿病蛋白质标志物预测

Prediction of diabetic protein markers based on an ensemble method.

作者信息

Qu Kaiyang, Zou Quan, Shi Hua

机构信息

School of Computer and Software, Nanyang Institute of Technology, 473004 Nanyang, Henan, China.

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China.

出版信息

Front Biosci (Landmark Ed). 2021 Jul 30;26(7):207-221. doi: 10.52586/4935.

DOI:10.52586/4935

PMID:34340268

Abstract

: A diabetic protein marker is a type of protein that is closely related to diabetes. This kind of protein plays an important role in the prevention and diagnosis of diabetes. Therefore, it is necessary to identify an effective method for predicting diabetic protein markers. In this study, we propose using ensemble methods to predict diabetic protein markers. : The ensemble method consists of two aspects. First, we combine a feature extraction method to obtain mixed features. Next, we classify the protein using ensemble classifiers. We use three feature extraction methods in the ensemble method, including composition and physicochemical features (abbreviated as 188D), adaptive skip gram features (abbreviated as 400D) and g-gap (abbreviated as 670D). There are six traditional classifiers in this study: decision tree, Naive Bayes, logistic regression, part, k-nearest neighbor, and kernel logistic regression. The ensemble classifiers are random forest and vote. First, we used feature extraction methods and traditional classifiers to classify protein sequences. Then, we compared the combined feature extraction methods with single methods. Next, we compared ensemble classifiers to traditional classifiers. Finally, we used ensemble classifiers and combined feature extraction methods to predict samples. : The results indicated that ensemble methods outperform single methods with respect to either ensemble classifiers or combined feature extraction methods. When the classifier is a random forest and the feature extraction method is 588D (combined 188D and 400D), the performance is best among all methods. The second best ensemble feature extraction method is 1285D (combining the three methods) with random forest. The best single feature extraction method is 188D, and the worst one is g-gap. : According to the results, the ensemble method, either the combined feature extraction method or the ensemble classifier, was better than the single method. We anticipate that ensemble methods will be a useful tool for identifying diabetic protein markers in a cost-effective manner.

摘要

糖尿病蛋白质标志物是一类与糖尿病密切相关的蛋白质。这类蛋白质在糖尿病的预防和诊断中起着重要作用。因此，有必要确定一种预测糖尿病蛋白质标志物的有效方法。在本研究中，我们提出使用集成方法来预测糖尿病蛋白质标志物。

集成方法包括两个方面。首先，我们结合一种特征提取方法来获得混合特征。其次，我们使用集成分类器对蛋白质进行分类。在集成方法中，我们使用了三种特征提取方法，包括组成和理化特征（简称为188D）、自适应跳字特征（简称为400D）和g-gap（简称为670D）。本研究中有六种传统分类器：决策树、朴素贝叶斯、逻辑回归、部分、k近邻和核逻辑回归。集成分类器是随机森林和投票。首先，我们使用特征提取方法和传统分类器对蛋白质序列进行分类。然后，我们将组合特征提取方法与单一方法进行比较。接下来，我们将集成分类器与传统分类器进行比较。最后，我们使用集成分类器和组合特征提取方法来预测样本。

结果表明，无论是集成分类器还是组合特征提取方法，集成方法都优于单一方法。当分类器为随机森林且特征提取方法为588D（188D和400D组合）时，在所有方法中性能最佳。第二好的集成特征提取方法是与随机森林结合的1285D（三种方法组合）。最好的单一特征提取方法是188D，最差的是g-gap。

根据结果，集成方法，无论是组合特征提取方法还是集成分类器，都比单一方法更好。我们预计集成方法将成为一种以经济高效的方式识别糖尿病蛋白质标志物的有用工具。