学习RNA和蛋白质序列的分布式表示及其在预测长链非编码RNA-蛋白质相互作用中的应用。

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions.

作者信息

Yi Hai-Cheng, You Zhu-Hong, Cheng Li, Zhou Xi, Jiang Tong-Hai, Li Xiao, Wang Yan-Bin

机构信息

The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.

University of Chinese Academy of Sciences, Beijing 100049, China.

出版信息

Comput Struct Biotechnol J. 2019 Nov 30;18:20-26. doi: 10.1016/j.csbj.2019.11.004. eCollection 2020.

DOI:10.1016/j.csbj.2019.11.004

PMID:31890140

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6926125/

Abstract

The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into -mer segmentation, which can be regard as "word" in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.

摘要

长链非编码RNA（lncRNAs）在生物体中普遍存在，并在多种生物过程和复杂疾病中发挥关键作用。新出现的证据表明，lncRNAs与相应蛋白质相互作用以执行其调节功能。因此，识别相互作用的lncRNA-蛋白质对是理解lncRNA功能和机制的第一步。由于通过高通量实验确定lncRNA-蛋白质相互作用既耗时又昂贵，因此需要开发更强大、更准确的计算方法。在本研究中，我们受自然语言与生物序列相似性的启发，开发了一种基于序列分布式表示学习的新方法，用于预测潜在的lncRNA-蛋白质相互作用，命名为LPI-Pred。更具体地说，lncRNA和蛋白质序列被分割成 - 聚体，这在自然语言处理中可被视为“单词”。然后，我们使用word2vec以及全基因组lncRNA和蛋白质序列训练出RNA2vec和Pro2vec模型，以挖掘RNA和蛋白质的分布式表示。接着，基于基尼信息杂质度量使用特征选择来降低复杂特征的维度。最后，这些有区分性的特征被用于训练随机森林分类器以预测lncRNA-蛋白质相互作用。采用五折交叉验证来评估LPI-Pred在三个基准数据集（包括RPI369、RPI488和RPI2241）上的性能。结果表明，LPI-Pred可以成为为生物学研究提供可靠指导的有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1700/6926125/da8b444b94d4/ga1.jpg

相似文献

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions.学习RNA和蛋白质序列的分布式表示及其在预测长链非编码RNA-蛋白质相互作用中的应用。

Comput Struct Biotechnol J. 2019 Nov 30;18:20-26. doi: 10.1016/j.csbj.2019.11.004. eCollection 2020.

LPI-HyADBS: a hybrid framework for lncRNA-protein interaction prediction integrating feature selection and classification.LPI-HyADBS：一种集成特征选择和分类的 lncRNA-蛋白质相互作用预测的混合框架。

BMC Bioinformatics. 2021 Nov 26;22(1):568. doi: 10.1186/s12859-021-04485-x.

LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA-protein interaction identification.LPI-deepGBDT：基于梯度提升决策树的多层深度框架，用于 lncRNA-蛋白质相互作用识别。

BMC Bioinformatics. 2021 Oct 4;22(1):479. doi: 10.1186/s12859-021-04399-8.

SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.SFPEL-LPI：基于序列的特征投影集成学习预测 LncRNA-蛋白质相互作用。

PLoS Comput Biol. 2018 Dec 11;14(12):e1006616. doi: 10.1371/journal.pcbi.1006616. eCollection 2018 Dec.

RPITER: A Hierarchical Deep Learning Framework for ncRNA⁻Protein Interaction Prediction.RPITER：一种用于 ncRNA-蛋白质相互作用预测的分层深度学习框架。

Int J Mol Sci. 2019 Mar 1;20(5):1070. doi: 10.3390/ijms20051070.

LPI-CNNCP: Prediction of lncRNA-protein interactions by using convolutional neural network with the copy-padding trick.LPI-CNNCP：利用卷积神经网络和复制填充技术预测 lncRNA-蛋白质相互作用。

Anal Biochem. 2020 Jul 15;601:113767. doi: 10.1016/j.ab.2020.113767. Epub 2020 May 23.

RLF-LPI: An ensemble learning framework using sequence information for predicting lncRNA-protein interaction based on AE-ResLSTM and fuzzy decision.RLF-LPI：一种基于 AE-ResLSTM 和模糊决策的利用序列信息进行 lncRNA-蛋白质相互作用预测的集成学习框架。

Math Biosci Eng. 2022 Mar 11;19(5):4749-4764. doi: 10.3934/mbe.2022222.

A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information.一种利用进化信息对非编码RNA-蛋白质相互作用进行稳健且准确预测的深度学习框架。

Mol Ther Nucleic Acids. 2018 Jun 1;11:337-344. doi: 10.1016/j.omtn.2018.03.001. Epub 2018 Mar 9.

Predicting lncRNA-disease associations using network topological similarity based on deep mining heterogeneous networks.基于深度挖掘异质网络的网络拓扑相似性预测 lncRNA-疾病关联。

Math Biosci. 2019 Sep;315:108229. doi: 10.1016/j.mbs.2019.108229. Epub 2019 Jul 16.

LPI-SKMSC: Predicting LncRNA-Protein Interactions with Segmented k-mer Frequencies and Multi-space Clustering.LPI-SKMSC：基于分段 k--mer 频率和多空间聚类的长链非编码 RNA-蛋白质相互作用预测。

Interdiscip Sci. 2024 Jun;16(2):378-391. doi: 10.1007/s12539-023-00598-4. Epub 2024 Jan 11.

引用本文的文献

RPIPLM: Prediction of ncRNA-protein interaction by post-training a dual-tower pretrained biological model with supervised contrastive learning.RPIPLM：通过使用监督对比学习对双塔预训练生物模型进行训练后预测非编码RNA与蛋白质的相互作用

PLoS One. 2025 Aug 14;20(8):e0329174. doi: 10.1371/journal.pone.0329174. eCollection 2025.

LPItabformer: Enhancing generalization in predicting lncRNA-protein interactions via a tabular Transformer.LPItabformer：通过表格Transformer增强lncRNA-蛋白质相互作用预测中的泛化能力。

Comput Struct Biotechnol J. 2025 May 29;27:2323-2335. doi: 10.1016/j.csbj.2025.05.050. eCollection 2025.

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景：任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述

Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

A Comprehensive Review of Bioinformatics Tools for Genomic Biomarker Discovery Driving Precision Oncology.生物信息学工具在基因组生物标志物发现中的应用综述——推动精准肿瘤学发展

Genes (Basel). 2024 Aug 6;15(8):1036. doi: 10.3390/genes15081036.

CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning.CBIL-VHPLI：一种基于机器学习和迁移学习的预测病毒-宿主蛋白-lncRNA 相互作用的模型。

Sci Rep. 2024 Jul 30;14(1):17549. doi: 10.1038/s41598-024-68750-8.

Protein feature engineering framework for AMPylation site prediction.蛋白质修饰位点预测的特征工程框架。

Sci Rep. 2024 Apr 15;14(1):8695. doi: 10.1038/s41598-024-58450-8.

Precise prediction of phase-separation key residues by machine learning.通过机器学习准确预测相分离关键残基。

Nat Commun. 2024 Mar 26;15(1):2662. doi: 10.1038/s41467-024-46901-9.

Multiple sequence alignment-based RNA language model and its application to structural inference.基于多重序列比对的 RNA 语言模型及其在结构推断中的应用。

Nucleic Acids Res. 2024 Jan 11;52(1):e3. doi: 10.1093/nar/gkad1031.

LPIH2V: LncRNA-protein interactions prediction using HIN2Vec based on heterogeneous networks model.LPIH2V：基于异质网络模型使用HIN2Vec进行长链非编码RNA-蛋白质相互作用预测

Front Genet. 2023 Feb 10;14:1122909. doi: 10.3389/fgene.2023.1122909. eCollection 2023.

Learning to discover medicines.学习发现药物。

Int J Data Sci Anal. 2022 Nov 18:1-16. doi: 10.1007/s41060-022-00371-8.

本文引用的文献

LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities.LMTRDA：通过融合序列和相似性的多源信息，使用逻辑模型树来预测 miRNA-疾病关联。

PLoS Comput Biol. 2019 Mar 27;15(3):e1006865. doi: 10.1371/journal.pcbi.1006865. eCollection 2019 Mar.

A High Efficient Biological Language Model for Predicting Protein⁻Protein Interactions.一种用于预测蛋白质相互作用的高效生物语言模型。

Cells. 2019 Feb 3;8(2):122. doi: 10.3390/cells8020122.

SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions.SFPEL-LPI：基于序列的特征投影集成学习预测 LncRNA-蛋白质相互作用。

PLoS Comput Biol. 2018 Dec 11;14(12):e1006616. doi: 10.1371/journal.pcbi.1006616. eCollection 2018 Dec.

GENCODE reference annotation for the human and mouse genomes.GENCODE 人类和小鼠基因组参考注释。

Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773. doi: 10.1093/nar/gky955.

LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property.LncFinder：一个综合平台，利用序列固有组成、结构信息和物理化学性质来鉴定长非编码 RNA。

Brief Bioinform. 2019 Nov 27;20(6):2009-2027. doi: 10.1093/bib/bby065.

LPGNMF: Predicting Long Non-Coding RNA and Protein Interaction Using Graph Regularized Nonnegative Matrix Factorization.LPGNMF：基于图正则化非负矩阵分解的长非编码 RNA 与蛋白质相互作用预测

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):189-197. doi: 10.1109/TCBB.2018.2861009. Epub 2018 Jul 30.

Mol Ther Nucleic Acids. 2018 Jun 1;11:337-344. doi: 10.1016/j.omtn.2018.03.001. Epub 2018 Mar 9.

Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks.基于异质网络的 HeteSim 分数预测长非编码 RNA-蛋白质相互作用。

Sci Rep. 2017 Jun 16;7(1):3664. doi: 10.1038/s41598-017-03986-1.

A comprehensive overview and evaluation of circular RNA detection tools.环状RNA检测工具的全面概述与评估

PLoS Comput Biol. 2017 Jun 8;13(6):e1005420. doi: 10.1371/journal.pcbi.1005420. eCollection 2017 Jun.

IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction.IPMiner：基于堆叠自编码器的隐藏非编码RNA-蛋白质相互作用序列模式挖掘，用于准确的计算预测。

BMC Genomics. 2016 Aug 9;17:582. doi: 10.1186/s12864-016-2931-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

学习RNA和蛋白质序列的分布式表示及其在预测长链非编码RNA-蛋白质相互作用中的应用。

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献