Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
Interdiscip Sci. 2019 Dec;11(4):628-635. doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.
Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.
在宏基因组片段中进行准确的基因预测是一项具有挑战性的计算任务,这是由于数据的短读长、不完整和碎片化性质。大多数基因预测程序都是基于提取大量特征,然后应用统计方法或监督分类方法来预测基因。在我们的研究中,我们引入了一种用于宏基因组基因预测的卷积神经网络(CNN-MGP)程序,该程序可以直接从原始 DNA 序列中预测宏基因组片段中的基因,而无需进行手动特征提取和特征选择阶段。CNN-MGP 能够学习编码和非编码区域的特征,并区分编码和非编码开放阅读框(ORF)。我们在基于预定义 GC 含量范围的 10 个互斥数据集中训练了 10 个 CNN 模型。我们从每个片段中提取 ORF;然后,将 ORF 数值编码,并根据片段的 GC 含量输入到适当的 CNN 模型中。CNN 的输出是 ORF 编码基因的概率。最后,使用贪心算法选择最终的基因列表。总的来说,CNN-MGP 是有效的,在测试数据集上达到了 91%的准确率。CNN-MGP 展示了深度学习在宏基因组片段中预测基因的能力,并且它的准确率高于或可与使用预定义特征的最先进的基因预测程序相媲美。