Department of Informatics, Technical University of Munich, Garching, Germany.
Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany.
PLoS Comput Biol. 2021 May 10;17(5):e1008982. doi: 10.1371/journal.pcbi.1008982. eCollection 2021 May.
The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.
5' 非翻译区在调节 mRNA 翻译和蛋白质丰度方面起着关键作用。因此,准确地模拟 5'UTR 调控序列可以深入了解翻译控制机制,并有助于解释遗传变异。最近,一种基于大规模平行报告基因实验的模型被训练用来直接从 5'UTR 序列中预测核糖体负载(MRL)——一种翻译速率的替代物,具有很高的准确性。然而,这个模型仅限于报告基因实验中所研究的序列长度,因此,如果不大量丢失信息,就不能应用于大多数人类序列。在这里,我们引入了帧池化,这是一种新的神经网络操作,使我们能够为任何长度的 5'UTR 开发 MRL 预测模型。我们的模型在固定长度随机序列上表现出了最先进的性能,同时在更长的序列和各种与翻译相关的全基因组数据集中具有更好的泛化性能。我们在与β地中海贫血相关的 HBB 基因的 5'UTR 变体上进行了变体解释。帧池化可能在其他生物信息学预测任务中找到应用。此外,我们发布的开源模型可以帮助确定致病性遗传变异。