Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil.
Department of Applied Microbial Ecology, Helmholtz Centre for Environmental Research - UFZ GmbH, Leipzig, Saxony, Germany.
RNA Biol. 2024 Jan;21(1):1-12. doi: 10.1080/15476286.2024.2329451. Epub 2024 Mar 25.
The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of mer one-hot, mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.
非编码 RNA(ncRNA)序列的准确分类对于高级非编码基因组注释和分析至关重要,这是基因组学的一个基本方面,有助于理解 ncRNA 在各种生物过程中的功能和调控机制。虽然传统的机器学习方法已经被用于区分 ncRNA,但这些方法通常需要大量的特征工程。最近,深度学习算法在 ncRNA 分类方面取得了进展。本研究提出了 BioDeepFuse,这是一种将卷积神经网络(CNN)或双向长短期记忆(BiLSTM)网络与手工制作的特征相结合的混合深度学习框架,以提高准确性。该框架采用 mer 一位热码、mer 字典和特征提取技术的组合来表示输入。提取的特征嵌入到深度网络中,能够最佳地利用 ncRNA 序列的空间和序列细微差别。我们使用基准数据集和来自细菌的真实 RNA 样本评估了 BioDeepFuse 的性能。结果表明,ncRNA 分类的准确性很高,突出了我们的工具在处理复杂的 ncRNA 序列数据挑战方面的稳健性。CNN 或 BiLSTM 与外部特征的有效融合为未来的研究开辟了有前途的方向,特别是在改进 ncRNA 分类器和深入了解细胞过程和疾病表现中的 ncRNA 方面。除了在细菌生物中的原始应用外,我们框架中集成的方法和技术还有潜力使 BioDeepFuse 在各种更广泛的领域中发挥作用。