Ramakrishnaiah Yashpal, Morris Adam P, Dhaliwal Jasbir, Philip Melcy, Kuhlmann Levin, Tyagi Sonika
Central Clinical School, Monash University, Melbourne, VIC 3000, Australia.
School of Computing Technologies, Royal Melbourne Institute of Technology University, Melbourne, VIC 3000, Australia.
Epigenomes. 2023 Sep 15;7(3):22. doi: 10.3390/epigenomes7030022.
Long non-coding RNAs (lncRNAs), comprising a significant portion of the human transcriptome, serve as vital regulators of cellular processes and potential disease biomarkers. However, the function of most lncRNAs remains unknown, and furthermore, existing approaches have focused on gene-level investigation. Our work emphasizes the importance of transcript-level annotation to uncover the roles of specific transcript isoforms. We propose that understanding the mechanisms of lncRNA in pathological processes requires solving their structural motifs and interactomes. A complete lncRNA annotation first involves discriminating them from their coding counterparts and then predicting their functional motifs and target bio-molecules. Current in silico methods mainly perform primary-sequence-based discrimination using a reference model, limiting their comprehensiveness and generalizability. We demonstrate that integrating secondary structure and interactome information, in addition to using transcript sequence, enables a comprehensive functional annotation. Annotating lncRNA for newly sequenced species is challenging due to inconsistencies in functional annotations, specialized computational techniques, limited accessibility to source code, and the shortcomings of reference-based methods for cross-species predictions. To address these challenges, we developed a pipeline for identifying and annotating transcript sequences at the isoform level. We demonstrate the effectiveness of the pipeline by comprehensively annotating the lncRNA associated with two specific disease groups. The source code of our pipeline is available under the MIT licensefor local use by researchers to make new predictions using the pre-trained models or to re-train models on new sequence datasets. Non-technical users can access the pipeline through a web server setup.
长链非编码RNA(lncRNA)构成了人类转录组的很大一部分,是细胞过程的重要调节因子和潜在的疾病生物标志物。然而,大多数lncRNA的功能仍然未知,此外,现有方法主要集中在基因水平的研究上。我们的工作强调了转录本水平注释对于揭示特定转录本异构体作用的重要性。我们提出,了解lncRNA在病理过程中的机制需要解析它们的结构基序和相互作用组。完整的lncRNA注释首先要将它们与编码对应物区分开来,然后预测它们的功能基序和靶标生物分子。当前的计算机方法主要使用参考模型进行基于一级序列的区分,限制了它们的全面性和通用性。我们证明,除了使用转录本序列外,整合二级结构和相互作用组信息能够实现全面的功能注释。由于功能注释不一致、专门的计算技术、源代码获取受限以及基于参考的跨物种预测方法的缺点,对新测序物种的lncRNA进行注释具有挑战性。为了应对这些挑战,我们开发了一个用于在异构体水平识别和注释转录本序列的流程。我们通过全面注释与两个特定疾病组相关的lncRNA来证明该流程的有效性。我们流程的源代码可根据麻省理工学院许可获取,供研究人员本地使用,以便使用预训练模型进行新的预测或在新的序列数据集上重新训练模型。非技术用户可以通过设置的网络服务器访问该流程。