College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai 519041, China.
Math Biosci. 2019 Sep;315:108229. doi: 10.1016/j.mbs.2019.108229. Epub 2019 Jul 16.
A kind of noncoding RNA with length more than 200 nucleotides named long noncoding RNA (lncRNA) has gained considerable attention in recent decades. Many studies have confirmed that human genome contains many thousands of lncRNAs. LncRNAs play significant roles in many important biological processes, including complex disease diagnosis, prognosis, prevention and treatment. For some important diseases such as cancer, lncRNAs have been novel candidate biomarkers. However, the role of lncRNAs in human diseases is still in its infancy, and only a small part of lncRNA-disease associations have been experimentally verified. Predicting lncRNA-disease association is an important way to understand the mechanism and function of lncRNA involved in diseases to enrich the annotations of lncRNA. Therefore, it is urgent to prioritize lncRNAs potentially associated with diseases. Biological system is a highly complex heterogenous network involved different molecules. Therefore, the algorithms based on network methods have been extensively applied in information fields which can provide a quantifiable characterization for the networks characterizing multifarious biological systems. A heterogeneous network topology possessing abundant interactions between biomedical entities is rarely utilized in similarity-based methods for predicting lncRNA-disease associations based on the array of varying features of lncRNAs and diseases. DeepWalk, encoding the relations of nodes in a continuous vector space, is an extension of language model and unsupervised learning from sequence-based word to network. In this article, we present a novel lncRNA-disease association prediction method based on DeepWalk, which enhances the existing association discovery methods through a topology-based similarity measure. We integrate the heterogeneous data to construct a Linked Tripartite Network which is a heterogeneous network containing three types od nodes which generated from bioinformatics linked datasets and use DeepWalk method to extract topological structure features of the nodes in the linked tripartite network for calculating similarities. Our proposed method can be separated into the following steps: Firstly, we integrate heterogeneous data to construct a Linked Tripartite Network: containing the topological interactions of known lncRNA-disease, lncRNA-microRNA and microRNA-disease. Secondly, the topological structure features of the nodes are extracted based on DeepWalk. Thirdly, similarity scores of disease-disease pairs and lncRNA-lncRNA pairs are computed based on the topology of this network. Finally, new lncRNA and disease associations are discovered by rule-based inference method with lncRNA-lncRNA similarities. Our proposed method shows superior predictive performance for prediction of lncRNA-disease associations based on topological similarity from heterogenous network. The AUC value is used to show the performance of our method. The similarity measurement using network topology based on DeepWalk provide a novel perspective which is different from the similarity derived from sequence or structure information. Availability: All the data and codes are freely availability at: https://github.com/Pengeace/lncRNA-disease-link.
一种长度超过 200 个核苷酸的非编码 RNA 被命名为长非编码 RNA(lncRNA),在最近几十年引起了相当大的关注。许多研究已经证实,人类基因组包含数千种 lncRNA。lncRNA 在许多重要的生物学过程中发挥着重要作用,包括复杂疾病的诊断、预后、预防和治疗。对于癌症等一些重要疾病,lncRNA 已经成为新的候选生物标志物。然而,lncRNA 在人类疾病中的作用仍处于起步阶段,只有一小部分 lncRNA-疾病关联已通过实验验证。预测 lncRNA-疾病关联是理解参与疾病的 lncRNA 的机制和功能的重要方法,可丰富 lncRNA 的注释。因此,优先考虑与疾病相关的 lncRNA 是当务之急。生物系统是一个涉及不同分子的高度复杂的异质网络。因此,基于网络方法的算法已广泛应用于信息领域,可以为描述各种生物系统的网络提供可量化的特征。基于不同 lncRNA 和疾病特征的相似性方法,很少利用具有丰富生物实体之间相互作用的异质网络拓扑结构来预测 lncRNA-疾病关联。DeepWalk 将节点之间的关系编码到连续的向量空间中,是一种基于序列的词到网络的语言模型和无监督学习的扩展。在本文中,我们提出了一种基于 DeepWalk 的新型 lncRNA-疾病关联预测方法,该方法通过拓扑相似性度量增强了现有的关联发现方法。我们整合异质数据来构建链接三分网络,这是一个包含三种类型节点的异质网络,这些节点是从生物信息学链接数据集生成的,并使用 DeepWalk 方法提取链接三分网络中节点的拓扑结构特征,用于计算相似性。我们提出的方法可以分为以下步骤:首先,我们整合异质数据来构建链接三分网络:包含已知 lncRNA-疾病、lncRNA- microRNA 和 microRNA-疾病的拓扑相互作用。其次,基于 DeepWalk 提取节点的拓扑结构特征。第三,根据该网络的拓扑结构计算疾病-疾病对和 lncRNA-lncRNA 对的相似性得分。最后,通过基于规则的推理方法,利用 lncRNA-lncRNA 相似性发现新的 lncRNA 和疾病关联。我们提出的方法在基于异质网络拓扑的拓扑相似性预测 lncRNA-疾病关联方面表现出优异的预测性能。使用 AUC 值来表示我们方法的性能。基于 DeepWalk 的网络拓扑相似性测量提供了一个不同于基于序列或结构信息的相似性的新视角。可用性:所有数据和代码均可在以下网址免费获取:https://github.com/Pengeace/lncRNA-disease-link。