Yi Hai-Cheng, You Zhu-Hong, Cheng Li, Zhou Xi, Jiang Tong-Hai, Li Xiao, Wang Yan-Bin
The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Comput Struct Biotechnol J. 2019 Nov 30;18:20-26. doi: 10.1016/j.csbj.2019.11.004. eCollection 2020.
The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into -mer segmentation, which can be regard as "word" in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.
长链非编码RNA(lncRNAs)在生物体中普遍存在,并在多种生物过程和复杂疾病中发挥关键作用。新出现的证据表明,lncRNAs与相应蛋白质相互作用以执行其调节功能。因此,识别相互作用的lncRNA-蛋白质对是理解lncRNA功能和机制的第一步。由于通过高通量实验确定lncRNA-蛋白质相互作用既耗时又昂贵,因此需要开发更强大、更准确的计算方法。在本研究中,我们受自然语言与生物序列相似性的启发,开发了一种基于序列分布式表示学习的新方法,用于预测潜在的lncRNA-蛋白质相互作用,命名为LPI-Pred。更具体地说,lncRNA和蛋白质序列被分割成 - 聚体,这在自然语言处理中可被视为“单词”。然后,我们使用word2vec以及全基因组lncRNA和蛋白质序列训练出RNA2vec和Pro2vec模型,以挖掘RNA和蛋白质的分布式表示。接着,基于基尼信息杂质度量使用特征选择来降低复杂特征的维度。最后,这些有区分性的特征被用于训练随机森林分类器以预测lncRNA-蛋白质相互作用。采用五折交叉验证来评估LPI-Pred在三个基准数据集(包括RPI369、RPI488和RPI2241)上的性能。结果表明,LPI-Pred可以成为为生物学研究提供可靠指导的有用工具。