Hou Ruiyan, Wang Lida, Wu Yi-Jun
Laboratory of Molecular Toxicology, State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
College of Life Science, University of Chinese Academy of Sciences, Beijing, China.
Front Genet. 2020 Mar 25;11:156. doi: 10.3389/fgene.2020.00156. eCollection 2020.
ATP-binding cassette (ABC) proteins play important roles in a wide variety of species. These proteins are involved in absorbing nutrients, exporting toxic substances, and regulating potassium channels, and they contribute to drug resistance in cancer cells. Therefore, the identification of ABC transporters is an urgent task. The present study used 188D as the feature extraction method, which is based on sequence information and physicochemical properties. We also visualized the feature extracted by t-Distributed Stochastic Neighbor Embedding (t-SNE). The sample based on the features extracted by 188D may be separated. Further, random forest (RF) is an efficient classifier to identify proteins. Under the 10-fold cross-validation of the model proposed here for a training set, the average accuracy rate of 10 training sets was 89.54%. We obtained values of 0.87 for specificity, 0.92 for sensitivity, and 0.79 for MCC. In the testing set, the accuracy achieved was 89%. These results suggest that the model combining 188D with RF is an optimal tool to identify ABC transporters.
ATP结合盒(ABC)蛋白在多种物种中发挥着重要作用。这些蛋白参与营养物质吸收、有毒物质输出以及钾通道调节,并且它们与癌细胞的耐药性有关。因此,鉴定ABC转运蛋白是一项紧迫的任务。本研究使用188D作为特征提取方法,该方法基于序列信息和物理化学性质。我们还通过t分布随机邻域嵌入(t-SNE)对提取的特征进行了可视化。基于188D提取的特征的样本可能会被分离。此外,随机森林(RF)是一种用于鉴定蛋白质的高效分类器。在此处提出的模型针对训练集的10倍交叉验证下,10个训练集的平均准确率为89.54%。我们得到的特异性值为0.87,灵敏度值为0.92,马修斯相关系数(MCC)值为0.79。在测试集中,实现的准确率为89%。这些结果表明,将188D与RF相结合的模型是鉴定ABC转运蛋白的最佳工具。