Suppr超能文献

isGPT:一种基于 SVM 和随机森林特征选择的亚高尔基体蛋白类型识别优化模型。

isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection.

机构信息

Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.

Indiana University, Bloomington, USA.

出版信息

Artif Intell Med. 2018 Jan;84:90-100. doi: 10.1016/j.artmed.2017.11.003. Epub 2017 Nov 26.

Abstract

The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT.

摘要

高尔基体(GA)是真核细胞内蛋白质合成的关键细胞器。GA 的主要任务是修饰和分类蛋白质,以便在整个细胞中运输。蛋白质通过内质网(ER)面向 GA 的一侧(顺面)渗透,然后从另一侧(反面)离开。基于这一现象,我们得到了两种类型的 GA 蛋白,即顺式高尔基体蛋白和反式高尔基体蛋白。GA 蛋白的任何功能障碍都可能导致先天性糖基化障碍和其他一些形式的困难,这些困难可能导致神经退行性疾病和遗传性疾病,如糖尿病、癌症和囊性纤维化。因此,GA 蛋白的准确分类可能有助于药物开发,从而进一步有助于药物治疗。在本文中,我们专注于建立一个新的计算模型,该模型不仅引入了从蛋白质序列中提取特征的简单方法,而且优化了反式高尔基体和顺式高尔基体蛋白的分类。在特征提取之后,我们使用随机森林(RF)模型根据从 RF 模型获得的重要性得分对特征进行排序。在选择排名最高的特征之后,我们应用支持向量机(SVM)对亚高尔基体蛋白进行分类。我们训练了回归模型和分类模型,并发现前者更优。该模型在所有先前的方法上都表现出了改进的性能。由于基准数据集严重不平衡,我们应用了合成少数过采样技术(SMOTE)对数据集进行平衡处理,并在两个版本上进行了实验。我们的方法,即亚高尔基体蛋白类型识别(isGPT),在 10 折交叉验证测试、jackknife 测试和独立测试中分别达到了 95.4%、95.9%和 95.3%的准确率。根据不同的性能指标,isGPT 比最先进的技术表现更好。isGPT 的源代码,以及相关数据集和详细的实验结果,可以在 https://github.com/srautonu/isGPT 找到。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验