LMFE:一种基于多特征融合与集成学习预测植物长链非编码RNA的新方法。

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning.

作者信息

Zhang Hongwei, Shi Yan, Wang Yapeng, Yang Xu, Li Kefeng, Im Sio-Kei, Han Yu

机构信息

Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR 999074, China.

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.

出版信息

Genes (Basel). 2025 Mar 31;16(4):424. doi: 10.3390/genes16040424.

Abstract

: Long non-coding RNAs (lncRNAs) play a crucial regulatory role in plant trait expression and disease management, making their accurate prediction a key research focus for guiding biological experiments. While extensive studies have been conducted on animals and humans, plant lncRNA research remains relatively limited due to various challenges, such as data scarcity and genomic complexity. This study aims to bridge this gap by developing an effective computational method for predicting plant lncRNAs, specifically by classifying transcribed RNA sequences as lncRNAs or mRNAs using multi-feature analysis. : We propose the lncRNA multi-feature-fusion ensemble learning (LMFE) approach, a novel method that integrates 100-dimensional features from RNA biological properties-based, sequence-based, and structure-based features, employing the XGBoost ensemble learning algorithm for prediction. To address unbalanced datasets, we implemented the synthetic minority oversampling technique (SMOTE). LMFE was validated across benchmark datasets, cross-species datasets, unbalanced datasets, and independent datasets. : LMFE achieved an accuracy of 99.42%, an F1 of 0.99, and an MCC of 0.98 on the benchmark dataset, with robust cross-species performance (accuracy ranging from 89.30% to 99.81%). On unbalanced datasets, LMFE attained an average accuracy of 99.41%, representing a 12.29% improvement over traditional methods without SMOTE (average ACC of 87.12%). Compared to state-of-the-art methods, such as CPC2 and PLEKv2, LMFE consistently outperformed them across multiple metrics on independent datasets (with an accuracy ranging from 97.33% to 99.21%), with redundant features having minimal impact on performance. : LMFE provides a highly accurate and generalizable solution for plant lncRNA prediction, outperforming existing methods through multi-feature fusion and ensemble learning while demonstrating robustness to redundant features. Despite its effectiveness, variations in performance across species highlight the necessity for future improvements in managing diverse plant genomes. This method represents a valuable tool for advancing plant lncRNA research and guiding biological experiments.

摘要

长链非编码RNA(lncRNAs)在植物性状表达和疾病管理中发挥着关键的调控作用,因此其准确预测成为指导生物学实验的关键研究重点。尽管在动物和人类方面已经开展了大量研究,但由于数据稀缺和基因组复杂性等各种挑战,植物lncRNA研究仍然相对有限。本研究旨在通过开发一种有效的计算方法来预测植物lncRNAs,具体而言,是通过多特征分析将转录的RNA序列分类为lncRNAs或mRNAs。

我们提出了lncRNA多特征融合集成学习(LMFE)方法,这是一种新颖的方法,它整合了基于RNA生物学特性、基于序列和基于结构的100维特征,并采用XGBoost集成学习算法进行预测。为了解决数据集不平衡的问题,我们实施了合成少数过采样技术(SMOTE)。LMFE在基准数据集、跨物种数据集、不平衡数据集和独立数据集上进行了验证。

LMFE在基准数据集上的准确率达到99.42%,F1值为0.99,MCC为0.98,具有强大的跨物种性能(准确率范围为89.30%至99.81%)。在不平衡数据集上,LMFE的平均准确率达到99.41%,比未使用SMOTE的传统方法(平均ACC为87.12%)提高了12.29%。与CPC2和PLEKv2等现有方法相比,LMFE在独立数据集的多个指标上始终优于它们(准确率范围为97.33%至99.21%),冗余特征对性能的影响最小。

LMFE为植物lncRNA预测提供了一种高度准确且可推广的解决方案,通过多特征融合和集成学习优于现有方法,同时对冗余特征具有鲁棒性。尽管其有效性显著,但不同物种间的性能差异凸显了未来在处理多样植物基因组方面进行改进的必要性。该方法是推进植物lncRNA研究和指导生物学实验的宝贵工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f714/12026654/1254a89f9bdd/genes-16-00424-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索