Lin Yang, Pan Xiaoyong, Shen Hong-Bin
Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China.
Bioinformatics. 2021 Aug 25;37(16):2308-2316. doi: 10.1093/bioinformatics/btab127.
Long non-coding RNAs (lncRNAs) are generally expressed in a tissue-specific way, and subcellular localizations of lncRNAs depend on the tissues or cell lines that they are expressed. Previous computational methods for predicting subcellular localizations of lncRNAs do not take this characteristic into account, they train a unified machine learning model for pooled lncRNAs from all available cell lines. It is of importance to develop a cell-line-specific computational method to predict lncRNA locations in different cell lines.
In this study, we present an updated cell-line-specific predictor lncLocator 2.0, which trains an end-to-end deep model per cell line, for predicting lncRNA subcellular localization from sequences. We first construct benchmark datasets of lncRNA subcellular localizations for 15 cell lines. Then we learn word embeddings using natural language models, and these learned embeddings are fed into convolutional neural network, long short-term memory and multilayer perceptron to classify subcellular localizations. lncLocator 2.0 achieves varying effectiveness for different cell lines and demonstrates the necessity of training cell-line-specific models. Furthermore, we adopt Integrated Gradients to explain the proposed model in lncLocator 2.0, and find some potential patterns that determine the subcellular localizations of lncRNAs, suggesting that the subcellular localization of lncRNAs is linked to some specific nucleotides.
The lncLocator 2.0 is available at www.csbio.sjtu.edu.cn/bioinf/lncLocator2 and the source code can be found at https://github.com/Yang-J-LIN/lncLocator2.
长链非编码RNA(lncRNA)通常以组织特异性方式表达,并且lncRNA的亚细胞定位取决于它们所表达的组织或细胞系。先前预测lncRNA亚细胞定位的计算方法未考虑这一特征,它们为来自所有可用细胞系的合并lncRNA训练统一的机器学习模型。开发一种细胞系特异性计算方法来预测不同细胞系中lncRNA的位置非常重要。
在本研究中,我们提出了一种更新的细胞系特异性预测器lncLocator 2.0,它为每个细胞系训练一个端到端深度模型,用于从序列预测lncRNA亚细胞定位。我们首先构建了15个细胞系的lncRNA亚细胞定位基准数据集。然后我们使用自然语言模型学习词嵌入,并将这些学习到的嵌入输入卷积神经网络、长短期记忆网络和多层感知器以对亚细胞定位进行分类。lncLocator 2.0对不同细胞系具有不同的有效性,并证明了训练细胞系特异性模型的必要性。此外,我们采用集成梯度法来解释lncLocator 2.0中提出的模型,并发现了一些决定lncRNA亚细胞定位的潜在模式,这表明lncRNA的亚细胞定位与某些特定核苷酸相关。
lncLocator 2.0可在www.csbio.sjtu.edu.cn/bioinf/lncLocator2获取,源代码可在https://github.com/Yang-J-LIN/lncLocator2找到。