Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.
College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China.
BMC Bioinformatics. 2022 May 5;23(1):162. doi: 10.1186/s12859-022-04702-1.
Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance.
In this paper, we propose a novel deep learning model (CNN + Transformer) for identifying orphan genes in moso bamboo. It uses a convolutional neural network in combination with a transformer neural network to capture k-mer amino acids and features between k-mer amino acids in protein sequences. The experimental results show that the average balance accuracy value of CNN + Transformer on moso bamboo dataset can reach 0.875, and the average Matthews Correlation Coefficient (MCC) value can reach 0.471. For the same testing set, the Balance Accuracy (BA), Geometric Mean (GM), Bookmaker Informedness (BM), and MCC values of the recurrent neural network, long short-term memory, gated recurrent unit, and transformer models are all lower than those of CNN + Transformer, which indicated that the model has the extensive ability for OG identification in moso bamboo.
CNN + Transformer model is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. As our knowledge, this is the first model to adopt the deep learning techniques for identifying orphan genes in plants.
孤儿基因在许多物种的环境胁迫中起着重要作用,它们的鉴定是了解生物功能的关键步骤。毛竹具有很高的生态、经济和文化价值。研究表明,毛竹的生长受到各种胁迫的影响。几种传统方法既耗时又低效。因此,开发高效、高精度的计算方法来预测孤儿基因具有重要意义。
本文提出了一种新的深度学习模型(CNN+Transformer),用于识别毛竹中的孤儿基因。它使用卷积神经网络与 Transformer 神经网络相结合,来捕获蛋白质序列中 k-mer 氨基酸和 k-氨基酸之间的特征。实验结果表明,CNN+Transformer 在毛竹数据集上的平均平衡准确率值可达 0.875,平均马修斯相关系数(MCC)值可达 0.471。对于相同的测试集,递归神经网络、长短期记忆、门控循环单元和 Transformer 模型的平衡准确率(BA)、几何平均值(GM)、博彩商信息量(BM)和 MCC 值均低于 CNN+Transformer,这表明该模型具有广泛的毛竹 OG 识别能力。
CNN+Transformer 模型是可行的,并获得了可靠的预测结果。它也可能为其他相关研究提供有价值的参考。据我们所知,这是第一个采用深度学习技术来识别植物孤儿基因的模型。