Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China.
Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada.
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab360.
Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.
长非编码 RNA(lncRNA)是一类具有超过 200 个核苷酸的 RNA 分子。越来越多的证据表明,lncRNA 的亚细胞定位可以为其生物学功能提供有价值的见解。现有的预测 lncRNA 亚细胞定位的计算方法使用 k-mer 特征对 lncRNA 序列进行编码。然而,仅使用 k-mer 特征会丢失序列顺序信息。我们提出了一种深度学习框架 DeepLncLoc 来预测 lncRNA 亚细胞定位。在 DeepLncLoc 中,我们引入了一种新的子序列嵌入方法,该方法保留了 lncRNA 序列的顺序信息。子序列嵌入方法首先将序列划分为一些连续的子序列,然后提取每个子序列的模式,最后将这些模式组合起来,以获得 lncRNA 序列的完整表示。之后,采用文本卷积神经网络来学习高级特征并执行预测任务。与传统的机器学习模型、流行的表示方法和现有的预测器相比,DeepLncLoc 取得了更好的性能,这表明 DeepLncLoc 可以有效地预测 lncRNA 亚细胞定位。我们的研究不仅提出了一种用于预测 lncRNA 亚细胞定位的新型计算模型,还引入了一种新的子序列嵌入方法,有望应用于其他基于序列的预测任务。DeepLncLoc 网站服务器可在 http://bioinformatics.csu.edu.cn/DeepLncLoc/ 免费访问,源代码和数据集可从 https://github.com/CSUBioGroup/DeepLncLoc 下载。