Suppr超能文献

mRCat:一种新型的 CatBoost 预测器,通过融合大语言模型表示和序列特征,用于 mRNA 亚细胞定位的二分类。

mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features.

机构信息

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China.

Henan Provincial Key Laboratory of Data Intelligence for Food Safety, Zhengzhou University of Light Industry, Zhengzhou 450002, China.

出版信息

Biomolecules. 2024 Jun 27;14(7):767. doi: 10.3390/biom14070767.

Abstract

The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. Several computational methods have been proposed to predict the subcellular localization of mRNAs within cells. However, there remains a deficiency in the accuracy of these predictions. In this study, we propose an mRCat predictor based on the gradient boosting tree algorithm specifically to predict whether mRNAs are localized in the nucleus or in the cytoplasm. This predictor firstly uses large language models to thoroughly explore hidden information within sequences and then integrates traditional sequence features to collectively characterize mRNA gene sequences. Finally, it employs CatBoost as the base classifier for predicting the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtained accuracy of 0.761, F1 score of 0.710, MCC of 0.511, and AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deep insights for biomolecular research.

摘要

信使 RNA(mRNA)的亚细胞定位是生物分子的一个关键方面,与基因调控和蛋白质合成紧密相关,为生物医学领域的疾病诊断和药物开发提供了创新的见解。已经提出了几种计算方法来预测细胞内 mRNA 的亚细胞定位。然而,这些预测的准确性仍然存在不足。在这项研究中,我们提出了一种基于梯度提升树算法的 mRCat 预测器,专门用于预测 mRNA 是否定位于细胞核或细胞质中。该预测器首先使用大型语言模型来深入探索序列中的隐藏信息,然后整合传统的序列特征,共同表征 mRNA 基因序列。最后,它采用 CatBoost 作为基本分类器来预测 mRNA 的亚细胞定位。在独立测试集上的实验验证表明,mRCat 的准确率为 0.761,F1 得分为 0.710,MCC 为 0.511,AUROC 为 0.751。结果表明,与其他最先进的方法相比,我们的方法具有更高的准确性和鲁棒性。它有望为生物分子研究提供深入的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2962/11274395/61701f73464e/biomolecules-14-00767-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验