Hu Si-Le, Chen Ying-Li, Zhang Lu-Qiang, Bai Hui, Yang Jia-Hong, Li Qian-Zhong
School of Physical Science and Technology, Inner Mongolia University, Hohhot, China.
The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot, China.
Front Mol Biosci. 2024 Sep 5;11:1452142. doi: 10.3389/fmolb.2024.1452142. eCollection 2024.
Long non-coding RNAs (lncRNAs) play crucial roles in genetic markers, genome rearrangement, chromatin modifications, and other biological processes. Increasing evidence suggests that lncRNA functions are closely related to their subcellular localization. However, the distribution of lncRNAs in different subcellular localizations is imbalanced. The number of lncRNAs located in the nucleus is more than ten times that in the exosome.
In this study, we propose a new oversampling method to construct a predictive dataset and develop a predictive model called LncSTPred. This model improves the Adaboost algorithm for subcellular localization prediction using 3-mer, 3-RF sequence, and minimum free energy structure features.
By using our improved Adaboost algorithm, better prediction accuracy for lncRNA subcellular localization was obtained. In addition, we evaluated feature importance by using the F-score and analyzed the influence of highly relevant features on lncRNAs. Our study shows that the ANA features may be a key factor for predicting lncRNA subcellular localization, which correlates with the composition of stems and loops in the secondary structure of lncRNAs.
长链非编码RNA(lncRNAs)在遗传标记、基因组重排、染色质修饰及其他生物学过程中发挥着关键作用。越来越多的证据表明,lncRNA的功能与其亚细胞定位密切相关。然而,lncRNAs在不同亚细胞定位中的分布并不均衡。位于细胞核中的lncRNAs数量比外泌体中的多十多倍。
在本研究中,我们提出了一种新的过采样方法来构建预测数据集,并开发了一种名为LncSTPred的预测模型。该模型利用3聚体、3-RF序列和最小自由能结构特征改进了用于亚细胞定位预测的Adaboost算法。
通过使用我们改进的Adaboost算法,获得了更好的lncRNA亚细胞定位预测准确率。此外,我们使用F分数评估了特征重要性,并分析了高度相关特征对lncRNAs的影响。我们的研究表明,ANA特征可能是预测lncRNA亚细胞定位的关键因素,这与lncRNAs二级结构中的茎环组成相关。