Department of Computer Science, Faculty of Mathematical and Computer Science, University of Gezira, Wad Madani, Sudan.
Preparatory Year Department, Al-Ghad International Colleges for Applied Medical Sciences, Riyadh, Saudi Arabia.
PLoS One. 2022 Oct 6;17(10):e0275195. doi: 10.1371/journal.pone.0275195. eCollection 2022.
Plasmodium falciparum is a parasitic protozoan that can cause malaria, which is a deadly disease. Therefore, the accurate identification of malaria parasite mitochondrial proteins is essential for understanding their functions and identifying novel drug targets. For classifying protein sequences, several adaptive statistical techniques have been devised. Despite significant gains, prediction performance is still constrained by the lack of appropriate feature descriptors and learning strategies in current systems. Moreover, good ground truth data is important for Artificial Intelligence (AI)-based models but there is a lack of that data in the literature. Therefore, in this work, we propose a novel hybrid network that combines 1D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BGRU) to classify the malaria parasite mitochondrial proteins. Furthermore, we curate a sequential data that are collected from National Center for Biotechnology Information (NCBI) and UniProtKB/Swiss-Prot proteins databanks to prepare a dataset that can be used by the research community for AI-based algorithms evaluation. We obtain 4204 cases after preprocessing of the collected data and denote this set of proteins as PF4204. Finally, we conduct an ablation study on several conventional and deep models using PF4204 and the benchmark PF2095 datasets. The proposed model 'CNN-BGRU' obtains the accuracy values of 0.9096 and 0.9857 on PF4204 and PF2095 datasets, respectively. In addition, the CNN-BGRU is compared with state-of-the-arts, where the results illustrate that it can extract robust features and identify proteins accurately.
疟原虫是一种能引起疟疾的寄生虫原生动物,疟疾是一种致命疾病。因此,准确识别疟原虫线粒体蛋白对于了解其功能和鉴定新的药物靶点至关重要。为了对蛋白质序列进行分类,已经设计了几种自适应统计技术。尽管取得了显著的进展,但预测性能仍然受到当前系统中缺乏适当的特征描述符和学习策略的限制。此外,人工智能 (AI) 模型需要良好的真实数据,但文献中缺乏这种数据。因此,在这项工作中,我们提出了一种新的混合网络,该网络结合了一维卷积神经网络 (CNN) 和双向门控循环单元 (BGRU) 来对疟原虫线粒体蛋白进行分类。此外,我们从国家生物技术信息中心 (NCBI) 和 UniProtKB/Swiss-Prot 蛋白质数据库中收集了顺序数据,以准备一个数据集,供研究社区用于 AI 算法评估。我们在对收集的数据进行预处理后得到了 4204 个案例,并将这组蛋白质表示为 PF4204。最后,我们使用 PF4204 和基准 PF2095 数据集对几种传统和深度模型进行了消融研究。所提出的模型“CNN-BGRU”在 PF4204 和 PF2095 数据集上分别获得了 0.9096 和 0.9857 的准确率值。此外,CNN-BGRU 与最先进的方法进行了比较,结果表明它可以提取稳健的特征并准确识别蛋白质。