Yi Weijun, Miller Jason R, Hu Gangqing, Adjeroh Donald A
Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA.
Computer Science and Information Technology, Hood College, Frederick, MD 21701, USA.
Noncoding RNA. 2025 Jun 25;11(4):49. doi: 10.3390/ncrna11040049.
Long non-coding Ribonucleic Acids (lncRNAs) can be localized to different cellular compartments, such as the nuclear and the cytoplasmic regions. Their biological functions are influenced by the region of the cell where they are located. Compared to the vast number of lncRNAs, only a relatively small proportion have annotations regarding their subcellular localization. It would be helpful if those few annotated lncRNAs could be leveraged to develop predictive models for localization of other lncRNAs. Conventional computational methods use -mer profiles from lncRNA sequences and train machine learning models such as support vector machines and logistic regression with the profiles. These methods focus on the exact -mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these variabilities might improve our ability to model lncRNAs and their localization. Thus, we build on inexact -mers and use machine learning/deep learning techniques to study three specific problems in lncRNA subcellular localization, namely, prediction of lncRNA localization using inexact -mers, the issue of whether lncRNA localization is cell-type-specific, and the notion of switching (lncRNA) genes. We performed our analysis using data on lncRNA localization across 15 cell lines. Our results showed that using inexact -mers (with = 6) can improve the lncRNA localization prediction performance compared to using exact -mers. Further, we showed that lncRNA localization, in general, is not cell-line-specific. We also identified a category of LncRNAs which switch cellular compartments between different cell lines (we call them switching lncRNAs). These switching lncRNAs complicate the problem of predicting lncRNA localization using machine learning models, showing that lncRNA localization is still a major challenge.
长链非编码核糖核酸(lncRNAs)可定位于不同的细胞区室,如细胞核和细胞质区域。它们的生物学功能受其所在细胞区域的影响。与大量的lncRNAs相比,只有相对较小比例的lncRNAs具有关于其亚细胞定位的注释。如果能利用这少数已注释的lncRNAs来开发其他lncRNAs定位的预测模型,将会很有帮助。传统的计算方法使用lncRNA序列的k-mer谱,并使用这些谱训练机器学习模型,如支持向量机和逻辑回归。这些方法关注确切的k-mer。考虑到基因组序列中可能的序列突变和其他不确定性及其在生物学功能中的作用,考虑这些变异性可能会提高我们对lncRNAs及其定位进行建模的能力。因此,我们基于不精确的k-mer构建模型,并使用机器学习/深度学习技术来研究lncRNA亚细胞定位中的三个具体问题,即使用不精确的k-mer预测lncRNA定位、lncRNA定位是否具有细胞类型特异性的问题以及(lncRNA)基因切换的概念。我们使用了15种细胞系中lncRNA定位的数据进行分析。我们的结果表明,与使用精确的k-mer相比,使用不精确的k-mer(k = 6)可以提高lncRNA定位预测性能。此外,我们表明一般来说lncRNA定位不是细胞系特异性的。我们还鉴定出一类在不同细胞系之间切换细胞区室的lncRNAs(我们称它们为切换lncRNAs)。这些切换lncRNAs使使用机器学习模型预测lncRNA定位的问题变得复杂,表明lncRNA定位仍然是一个重大挑战。