Liu Mengya, Sun Zhan-Li, Zeng Zhigang, Lam Kin-Man
School of Computer Science and Technology, Anhui University, Hefei 230601, China.
School of Electrical Engineering and Automation, Anhui University, Hefei 230601, China.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae647.
RNA N$^{6}$-methyladenosine (m$^{6}$A) is a critical epigenetic modification closely related to rice growth, development, and stress response. m$^{6}$A accurate identification, directly related to precision rice breeding and improvement, is fundamental to revealing phenotype regulatory and molecular mechanisms. Faced on rice m$^{6}$A variable-length sequence, to input into the model, the maximum length padding and label encoding usually adapt to obtain the max-length padded sequence for prediction. Although this can retain complete sequence information, resulting in sparse information and invalid padding, reducing feature extraction accuracy. Simultaneously, existing rice-specific m$^{6}$A prediction methods are still at an early stage. To address these issues, we develop a new end-to-end deep learning framework, MFDm$^{6}$ARice, for predicting rice m$^{6}$A sites. In particular, to alleviate sparseness, we construct a multi-kernel feature fusion module to mine essential information in max-length padded sequences by multi-kernel feature extraction function and effectively transfer information through global-local dynamic fusion function. Concurrently, considering the complexity and computational efficiency of high-dimensional features caused by invalid padding, we design a downsampling residual feature embedding module to optimize feature space compression and achieve accurate feature expression and efficient computational performance. Experiments show that MFDm$^{6}$ARice outperforms comparison methods in cross-validation, same- and cross-species independent test sets, demonstrating good robustness and generalization. The application on maize m$^{6}$A indicates the MFDm$^{6}$ARice's scalability. Further investigations have shown that combining different kernel features, focusing on global channel-local spatial, and employing reasonable downsampling and residual connections can improve feature representation and extraction, ensure effective information transfer, and significantly enhance model performance.
RNA N$^{6}$-甲基腺苷(m$^{6}$A)是一种与水稻生长、发育和应激反应密切相关的关键表观遗传修饰。m$^{6}$A的准确识别与精准水稻育种和改良直接相关,是揭示表型调控和分子机制的基础。面对水稻m$^{6}$A可变长度序列,为输入模型,通常采用最大长度填充和标签编码来获得用于预测的最大长度填充序列。虽然这可以保留完整的序列信息,但会导致信息稀疏和无效填充,降低特征提取精度。同时,现有的水稻特异性m$^{6}$A预测方法仍处于早期阶段。为了解决这些问题,我们开发了一种新的端到端深度学习框架MFDm$^{6}$ARice,用于预测水稻m$^{6}$A位点。具体而言,为了缓解稀疏性,我们构建了一个多核特征融合模块,通过多核特征提取函数挖掘最大长度填充序列中的关键信息,并通过全局-局部动态融合函数有效地传递信息。同时,考虑到无效填充导致的高维特征的复杂性和计算效率,我们设计了一个下采样残差特征嵌入模块,以优化特征空间压缩,实现准确的特征表达和高效的计算性能。实验表明,MFDm$^{6}$ARice在交叉验证、同种和跨物种独立测试集中优于比较方法,具有良好的稳健性和泛化能力。在玉米m$^{6}$A上的应用表明了MFDm$^{6}$ARice的可扩展性。进一步研究表明,结合不同的核特征,关注全局通道-局部空间,并采用合理的下采样和残差连接,可以改善特征表示和提取,确保有效的信息传递,并显著提高模型性能。