Asim Muhammad Nabeel, Ibrahim Muhammad Ali, Malik Muhammad Imran, Zehe Christoph, Cloarec Olivier, Trygg Johan, Dengel Andreas, Ahmed Sheraz
Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.
German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany.
Comput Struct Biotechnol J. 2022 Jul 26;20:3986-4002. doi: 10.1016/j.csbj.2022.07.031. eCollection 2022.
Subcellular localization of Ribonucleic Acid (RNA) molecules provide significant insights into the functionality of RNAs and helps to explore their association with various diseases. Predominantly developed single-compartment localization predictors (SCLPs) lack to demystify RNA association with diverse biochemical and pathological processes mainly happen through RNA co-localization in multiple compartments. Limited multi-compartment localization predictors (MCLPs) manage to produce decent performance only for target RNA class of particular sub-type. Further, existing computational approaches have limited practical significance and potential to optimize therapeutics due to the poor degree of model explainability. The paper in hand presents an explainable Long Short-Term Memory (LSTM) network "EL-RMLocNet", predictive performance and interpretability of which are optimized using a novel GeneticSeq2Vec statistical representation learning scheme and attention mechanism for accurate multi-compartment localization prediction of different RNAs solely using raw RNA sequences. GeneticSeq2Vec generates optimized statistical vectors of raw RNA sequences by capturing short and long range relations of nucleotide k-mers. Using sequence vectors generated by GeneticSeq2Vec scheme, Long Short Term Memory layers extract most informative features, weighting of which on the basis of discriminative potential for accurate multi-compartment localization prediction is performed using attention layer. Through reverse engineering, weights of statistical feature space are mapped to nucleotide k-mers patterns to make multi-compartment localization prediction decision making transparent and explainable for different RNA classes and species. Empirical evaluation indicates that EL-RMLocNet outperforms state-of-the-art predictor for subcellular localization prediction of 4 different RNA classes by an average accuracy figure of 8% for Homo Sapiens species and 6% for Mus Musculus species. EL-RMLocNet is freely available as a web server at (https://sds_genetic_analysis.opendfki.de/subcellular_loc/).
核糖核酸(RNA)分子的亚细胞定位为深入了解RNA的功能提供了重要线索,并有助于探索其与各种疾病的关联。主要开发的单隔室定位预测器(SCLP)无法揭示RNA与多种生化和病理过程的关联,而这些关联主要是通过RNA在多个隔室中的共定位发生的。有限的多隔室定位预测器(MCLP)仅对特定亚型的目标RNA类别能产生不错的性能。此外,由于模型可解释性程度较低,现有的计算方法在优化治疗方法方面的实际意义和潜力有限。本文提出了一种可解释的长短期记忆(LSTM)网络“EL-RMLocNet”,其预测性能和可解释性通过一种新颖的GeneticSeq2Vec统计表示学习方案和注意力机制进行了优化,以便仅使用原始RNA序列对不同RNA进行准确的多隔室定位预测。GeneticSeq2Vec通过捕获核苷酸k聚体的短程和长程关系,生成原始RNA序列的优化统计向量。使用GeneticSeq2Vec方案生成的序列向量,长短期记忆层提取最具信息性的特征,并使用注意力层根据准确的多隔室定位预测的判别潜力对其进行加权。通过逆向工程,将统计特征空间的权重映射到核苷酸k聚体模式,以使不同RNA类别和物种的多隔室定位预测决策透明且可解释。实证评估表明,EL-RMLocNet在人类物种的4种不同RNA类别的亚细胞定位预测方面比现有最先进的预测器平均准确率高出8%,在小家鼠物种中高出6%。EL-RMLocNet作为一个网络服务器可在(https://sds_genetic_analysis.opendfki.de/subcellular_loc/)免费获取。