Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
Interdiscip Sci. 2022 Dec;14(4):841-862. doi: 10.1007/s12539-022-00535-x. Epub 2022 Aug 10.
Interactions of long non-coding ribonucleic acids (lncRNAs) with micro-ribonucleic acids (miRNAs) play an essential role in gene regulation, cellular metabolic, and pathological processes. Existing purely sequence based computational approaches lack robustness and efficiency mainly due to the high length variability of lncRNA sequences. Hence, the prime focus of the current study is to find optimal length trade-offs between highly flexible length lncRNA sequences.
The paper at hand performs in-depth exploration of diverse copy padding, sequence truncation approaches, and presents a novel idea of utilizing only subregions of lncRNA sequences to generate fixed-length lncRNA sequences. Furthermore, it presents a novel bag of tricks-based deep learning approach "Bot-Net" which leverages a single layer long-short-term memory network regularized through DropConnect to capture higher order residue dependencies, pooling to retain most salient features, normalization to prevent exploding and vanishing gradient issues, learning rate decay, and dropout to regularize precise neural network for lncRNA-miRNA interaction prediction.
BoT-Net outperforms the state-of-the-art lncRNA-miRNA interaction prediction approach by 2%, 8%, and 4% in terms of accuracy, specificity, and matthews correlation coefficient. Furthermore, a case study analysis indicates that BoT-Net also outperforms state-of-the-art lncRNA-protein interaction predictor on a benchmark dataset by accuracy of 10%, sensitivity of 19%, specificity of 6%, precision of 14%, and matthews correlation coefficient of 26%.
In the benchmark lncRNA-miRNA interaction prediction dataset, the length of the lncRNA sequence varies from 213 residues to 22,743 residues and in the benchmark lncRNA-protein interaction prediction dataset, lncRNA sequences vary from 15 residues to 1504 residues. For such highly flexible length sequences, fixed length generation using copy padding introduces a significant level of bias which makes a large number of lncRNA sequences very much identical to each other and eventually derail classifier generalizeability. Empirical evaluation reveals that within 50 residues of only the starting region of long lncRNA sequences, a highly informative distribution for lncRNA-miRNA interaction prediction is contained, a crucial finding exploited by the proposed BoT-Net approach to optimize the lncRNA fixed length generation process.
BoT-Net web server can be accessed at https://sds_genetic_analysis.opendfki.de/lncmiRNA/.
长非编码核糖核酸(lncRNA)与微小核糖核酸(miRNA)的相互作用在基因调控、细胞代谢和病理过程中起着至关重要的作用。现有的纯序列计算方法主要由于 lncRNA 序列的长度高度可变,因此缺乏稳健性和效率。因此,当前研究的重点是在高度灵活的 lncRNA 序列的长度之间找到最佳的长度折衷。
本文深入探讨了多种复制填充、序列截断方法,并提出了一种利用 lncRNA 序列的子区域生成固定长度 lncRNA 序列的新想法。此外,它还提出了一种基于新技巧的深度学习方法“Bot-Net”,该方法利用通过 DropConnect 正则化的单层长短期记忆网络来捕获更高阶残差依赖关系、池化以保留最显著的特征、归一化以防止梯度爆炸和消失、学习率衰减和随机失活来正则化精确的神经网络进行 lncRNA-miRNA 相互作用预测。
BoT-Net 在准确性、特异性和 Matthews 相关系数方面分别比最先进的 lncRNA-miRNA 相互作用预测方法提高了 2%、8%和 4%。此外,案例研究分析表明,BoT-Net 在基准数据集上也优于最先进的 lncRNA-蛋白质相互作用预测器,其准确性提高了 10%,敏感性提高了 19%,特异性提高了 6%,精度提高了 14%,Matthews 相关系数提高了 26%。
在基准 lncRNA-miRNA 相互作用预测数据集,lncRNA 序列的长度从 213 个残基到 22743 个残基不等,在基准 lncRNA-蛋白质相互作用预测数据集,lncRNA 序列的长度从 15 个残基到 1504 个残基不等。对于如此高度灵活的长度序列,使用复制填充生成固定长度会引入显著的偏差,这使得大量的 lncRNA 序列彼此非常相似,最终导致分类器的泛化能力丧失。实证评估表明,在长 lncRNA 序列的起始区域的 50 个残基内,包含了 lncRNA-miRNA 相互作用预测的高度信息分布,这是一个重要的发现,被提议的 Bot-Net 方法利用该发现来优化 lncRNA 固定长度生成过程。
BoT-Net 网络服务器可在 https://sds_genetic_analysis.opendfki.de/lncmiRNA/ 访问。