Suppr超能文献

LncSL:一种通过氨基酸增强特征和两阶段自动选择策略进行长链非编码RNA亚细胞定位的新型堆叠集成计算工具。

LncSL: A Novel Stacked Ensemble Computing Tool for Subcellular Localization of lncRNA by Amino Acid-Enhanced Features and Two-Stage Automated Selection Strategy.

作者信息

Zhu Lun, Chen Hong, Yang Sen

机构信息

School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China.

出版信息

Int J Mol Sci. 2024 Dec 23;25(24):13734. doi: 10.3390/ijms252413734.

Abstract

Long non-coding RNA (lncRNA) is a non-coding RNA longer than 200 nucleotides, crucial for functions like cell cycle regulation and gene transcription. Accurate localization prediction from sequence information is vital for understanding lncRNA's biological roles. Computational methods offer an effective alternative to traditional experimental methods for annotating lncRNA subcellular positions. Existing machine learning-based methods are limited and often overlook regions with coding potential that affect the function of lncRNA. Therefore, we propose a new model called LncSL. For feature encoding, both lncRNA sequences and amino acid sequences from open reading frames (ORFs) are employed. And we selected the most suitable features by CatBoost and integrated them into a new feature set. Additionally, a voting process with seven feature selection algorithms identified the higher contributive features for training our final stacked model. Additionally, an automatic model selection strategy is constructed to find a better performance meta-model for assembling LncSL. This study specifically focuses on predicting the subcellular localization of lncRNA in the nucleus and cytoplasm. On two benchmark datasets called S1 and S2 datasets, LncSL outperformed existing methods by 6.3% to 12.3% in the Matthew's correlation coefficient on a balanced test dataset. On an unbalanced independent test dataset sourced from S1, LncSL improved by 4.7% to 18.6% in the Matthew's correlation coefficient, which further demonstrates that LncSL is superior to other compared methods. In all, this study presents an effective method for predicting lncRNA subcellular localization through enhancing sequence information, which is always overlooked by traditional methods, and addressing contributive meta-model selection problems, which can offer new insights for other bioinformatics problems.

摘要

长链非编码RNA(lncRNA)是一种长度超过200个核苷酸的非编码RNA,对细胞周期调控和基因转录等功能至关重要。从序列信息中准确预测其定位对于理解lncRNA的生物学作用至关重要。计算方法为注释lncRNA亚细胞定位提供了一种有效的替代传统实验方法的手段。现有的基于机器学习的方法存在局限性,常常忽略影响lncRNA功能的具有编码潜力的区域。因此,我们提出了一种名为LncSL的新模型。在特征编码方面,同时使用了lncRNA序列和开放阅读框(ORF)中的氨基酸序列。我们通过CatBoost选择了最合适的特征并将其整合到一个新的特征集中。此外,通过七种特征选择算法的投票过程确定了对训练我们最终的堆叠模型贡献更大的特征。此外,构建了一种自动模型选择策略,以找到一个性能更好的元模型来组装LncSL。本研究特别专注于预测lncRNA在细胞核和细胞质中的亚细胞定位。在两个名为S1和S2数据集的基准数据集上,在平衡测试数据集上,LncSL在马修斯相关系数方面比现有方法高出6.3%至12.3%。在源自S1的不平衡独立测试数据集上,LncSL在马修斯相关系数方面提高了4.7%至18.6%,这进一步证明了LncSL优于其他比较方法。总之,本研究提出了一种通过增强传统方法经常忽略的序列信息以及解决贡献性元模型选择问题来预测lncRNA亚细胞定位的有效方法,这可以为其他生物信息学问题提供新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f07/11678684/7d8601f7b64a/ijms-25-13734-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验