Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
J Theor Biol. 2019 Feb 21;463:99-109. doi: 10.1016/j.jtbi.2018.12.017. Epub 2018 Dec 15.
Automatic identification of protein subcellular localization has gained much popularity in the last few decades. Subcellular localizations are useful in diagnosis of different diseases as well as in the process of drug development. Golgi is a vital type of protein, which provides means of transportation to several other proteins destined for lysosome, plasma membrane and secretion etc. Cis-Golgi and trans-Golgi are two ends of Golgi protein meant for reception and transmission of various substances. Dysfunction in Golgi proteins may lead to different types of diseases especially the inheritable and neurodegenerative problems. Due to the significance of Golgi proteins, it is indispensable to correctly identify the Golgi proteins. In this paper, a novel and high throughput computational model is proposed which can identify the subGolgi proteins precisely. Discrete and evolutionary feature extraction schemes are applied so that all the salient, noiseless, and relevant information from protein sequences could be captured. Unfortunately, the benchmark dataset publicly available is quite imbalance, where trans-Golgi sequences constitute 72% of the whole dataset that reflects biasness, redundancy, and lack of hypothesis generalization. In order to cover the limitations of imbalance data, Synthetic Minority over Sampling Technique is utilized to balance the number of instances in different classes of the dataset. In addition, a condense feature space is formed by fusing the high rank features of eleven different feature selection techniques. The high rank features are selected through majority voting algorithm; consequently, the feature space is reduced 85%. The experiential results demonstrate that kNN classifier obtained promising results in combination with hybrid feature space. It has yielded an accuracy of 98% in jackknife cross-validation, 94% in independent data and 96% in 10-fold cross-validation test. It is ascertained that the proposed model is reliable, consistent and serves as a valuable tool for the research community.
自动识别蛋白质亚细胞定位在过去几十年中得到了广泛的关注。亚细胞定位在不同疾病的诊断以及药物开发过程中都很有用。高尔基体是一种重要的蛋白质类型,它为几种其他蛋白质提供了运往溶酶体、质膜和分泌等目的地的运输途径。顺式高尔基体和顺式高尔基体是高尔基体蛋白的两个末端,用于接收和传输各种物质。高尔基体蛋白功能障碍可能导致多种疾病,特别是遗传性和神经退行性问题。由于高尔基体蛋白的重要性,正确识别高尔基体蛋白是必不可少的。在本文中,提出了一种新颖的、高通量的计算模型,可以精确识别亚高尔基体蛋白。应用离散和进化特征提取方案,以便从蛋白质序列中捕获所有显著的、无噪声的和相关的信息。不幸的是,公开可用的基准数据集非常不平衡,其中顺式高尔基体序列构成整个数据集的 72%,这反映了偏见、冗余和缺乏假设泛化。为了克服不平衡数据的局限性,利用合成少数过采样技术来平衡数据集不同类别的实例数量。此外,通过融合十一种不同特征选择技术的高等级特征来形成一个紧凑的特征空间。高等级特征通过多数投票算法选择;因此,特征空间减少了 85%。实验结果表明,kNN 分类器与混合特征空间相结合取得了很好的结果。在交叉验证中,它在 jackknife 交叉验证中获得了 98%的准确率,在独立数据中获得了 94%的准确率,在 10 倍交叉验证测试中获得了 96%的准确率。可以确定,所提出的模型是可靠的、一致的,并且是研究社区的有价值的工具。