Biological Sciences Department, Purdue University, West Lafayette, IN, USA.
BMC Bioinformatics. 2019 Jul 30;20(1):409. doi: 10.1186/s12859-019-2999-7.
Internal ribosome entry sites (IRES) are segments of mRNA found in untranslated regions that can recruit the ribosome and initiate translation independently of the 5' cap-dependent translation initiation mechanism. IRES usually function when 5' cap-dependent translation initiation has been blocked or repressed. They have been widely found to play important roles in viral infections and cellular processes. However, a limited number of confirmed IRES have been reported due to the requirement for highly labor intensive, slow, and low efficiency laboratory experiments. Bioinformatics tools have been developed, but there is no reliable online tool.
This paper systematically examines the features that can distinguish IRES from non-IRES sequences. Sequence features such as kmer words, structural features such as Q, and sequence/structure hybrid features are evaluated as possible discriminators. They are incorporated into an IRES classifier based on XGBoost. The XGBoost model performs better than previous classifiers, with higher accuracy and much shorter computational time. The number of features in the model has been greatly reduced, compared to previous predictors, by including global kmer and structural features. The contributions of model features are well explained by LIME and SHapley Additive exPlanations. The trained XGBoost model has been implemented as a bioinformatics tool for IRES prediction, IRESpy (https://irespy.shinyapps.io/IRESpy/), which has been applied to scan the human 5' UTR and find novel IRES segments.
IRESpy is a fast, reliable, high-throughput IRES online prediction tool. It provides a publicly available tool for all IRES researchers, and can be used in other genomics applications such as gene annotation and analysis of differential gene expression.
内部核糖体进入位点(IRES)是在非翻译区发现的 mRNA 片段,可招募核糖体并独立于 5' 帽依赖性翻译起始机制起始翻译。当 5' 帽依赖性翻译起始被阻断或抑制时,IRES 通常起作用。它们在病毒感染和细胞过程中发挥着重要作用,这已经得到了广泛的证实。然而,由于需要高度劳动密集型、缓慢且效率低下的实验室实验,已报道的确认 IRES 数量有限。已经开发了生物信息学工具,但没有可靠的在线工具。
本文系统地检查了可将 IRES 与非 IRES 序列区分开来的特征。将 kmer 单词等序列特征、Q 等结构特征以及序列/结构混合特征评估为可能的鉴别器。它们被合并到基于 XGBoost 的 IRES 分类器中。XGBoost 模型的性能优于以前的分类器,具有更高的准确性和短得多的计算时间。与以前的预测器相比,通过包括全局 kmer 和结构特征,模型中的特征数量大大减少。LIME 和 SHapley Additive exPlanations 很好地解释了模型特征的贡献。训练有素的 XGBoost 模型已作为 IRES 预测的生物信息学工具实现,即 IRESpy(https://irespy.shinyapps.io/IRESpy/),它已被应用于扫描人类 5'UTR 并发现新的 IRES 片段。
IRESpy 是一种快速、可靠、高通量的 IRES 在线预测工具。它为所有 IRES 研究人员提供了一个公开可用的工具,并且可以用于其他基因组学应用,例如基因注释和差异基因表达分析。