Zhang Lina, Gao Sizan, Yuan Qinghao, Fu Yao, Yang Runtao
School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, 264209, China.
Comput Biol Chem. 2025 Apr;115:108336. doi: 10.1016/j.compbiolchem.2024.108336. Epub 2025 Jan 1.
Long non-coding RNAs (lncRNAs) are strongly associated with cellular physiological mechanisms and implicated in the numerous diseases. By exploring the subcellular localizations of lncRNAs, we can not only gain crucial insights into the molecular mechanisms of lncRNA-related biological processes but also make valuable contributions towards the diagnosis, prevention, and treatment of various human diseases. However, conventional experimental techniques tend to be laborious and time-intensive. In this context, computational methods are in increased demand. The focus of this paper is the development of an innovative ensemble method that incorporates hybrid features to accurately predict the subcellular localizations of lncRNAs. To address the issue of incomplete reflection of inherent correlation with the intended target using singular source features, the utilization of heterogeneous multi-source features is implemented by introducing information on sequence composition, physicochemical properties, and structure. To address the issue of the imbalance classes in the benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is employed. Finally, the resulting predictor termed lncSLPre is developed by integrating the outputs of the individual classifiers. Experimental findings suggest that the complementarity of multi-source heterogeneous features improves prediction performance. Additionally, it is demonstrated that the application of SMOTE is effective in mitigating the issue of the imbalanced dataset, while the feature selection approach is critical in eliminating extraneous and redundant features. Compared with existing advanced methods, lncSLPre achieves better performance with an overall accuracy improvement of 13.13%, 2.15%, and 3.23%, respectively, indicating that lncSLPre can effectively predict lncRNA subcellular localizations.
长链非编码RNA(lncRNAs)与细胞生理机制密切相关,并涉及多种疾病。通过探索lncRNAs的亚细胞定位,我们不仅可以深入了解lncRNA相关生物学过程的分子机制,还可以为各种人类疾病的诊断、预防和治疗做出重要贡献。然而,传统的实验技术往往既费力又耗时。在这种情况下,对计算方法的需求日益增加。本文的重点是开发一种创新的集成方法,该方法结合混合特征来准确预测lncRNAs的亚细胞定位。为了解决使用单一源特征不能完全反映与预期目标的内在相关性这一问题通过引入序列组成、物理化学性质和结构等信息来实现异构多源特征的利用。为了解决基准数据集中类不平衡的问题,采用了合成少数过采样技术(SMOTE)。最后,通过整合各个分类器的输出,开发出了名为lncSLPre的预测器。实验结果表明,多源异构特征的互补性提高了预测性能。此外,还证明了SMOTE的应用有效地缓解了数据集不平衡的问题,而特征选择方法对于消除无关和冗余特征至关重要。与现有的先进方法相比,lncSLPre的性能更好,总体准确率分别提高了13.13%、2.15%和3.23%,这表明lncSLPre可以有效地预测lncRNA的亚细胞定位