College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin 150040, China.
Methods. 2023 Apr;212:21-30. doi: 10.1016/j.ymeth.2023.02.009. Epub 2023 Feb 20.
Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.
长非编码 RNA 是一类长度超过 200 个核苷酸的必需非编码 RNA。最近的研究表明,lncRNAs 具有多种复杂的调节功能,对许多基本的生物过程都有很大的影响。然而,通过传统的湿实验来测量 lncRNAs 的功能相似性是费时费力的,基于计算的方法已成为解决这个问题的有效选择。同时,大多数基于序列的计算方法都用其固定长度的向量表示来测量 lncRNAs 的功能相似性,而这种方法无法捕捉到大的 k-mers 上的特征。因此,迫切需要提高 lncRNA 潜在调控功能的预测性能。在本研究中,我们提出了一种名为 MFSLNC 的新方法,该方法基于核苷酸序列的可变 k-mer 分布来全面测量 lncRNAs 的功能相似性。MFSLNC 采用字典树存储,可以全面表示长 k-mers 的 lncRNAs。通过杰卡德相似性来评估 lncRNAs 之间的功能相似性。MFSLNC 采用相同的机制来验证两条 lncRNA 之间的相似性,检测人类和小鼠之间的同源序列对。此外,MFSLNC 还应用于 lncRNA-疾病关联,与关联预测模型 WKNKN 相结合。此外,我们还通过比较基于 lncRNA-mRNA 关联数据的经典方法,证明了我们的方法可以更有效地计算 lncRNAs 的相似性。预测的 AUC 值为 0.867,在类似模型的比较中表现良好。