利用傅里叶变换架构和机器翻译任务改进蛋白质编码潜力的深度学习模型。

Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task.

机构信息

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America.

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, United States of America.

出版信息

PLoS Comput Biol. 2023 Oct 12;19(10):e1011526. doi: 10.1371/journal.pcbi.1011526. eCollection 2023 Oct.

DOI:10.1371/journal.pcbi.1011526

PMID:37824580

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10597526/

Abstract

Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.

摘要

核糖体是信息处理的高分子机器，它将信使 RNA（mRNA）转录本中的复杂序列模式整合起来合成蛋白质。对区分 mRNA 和长非编码 RNA（lncRNA）的序列特征的研究，可能有助于深入了解指导和调节翻译的信息。用于计算蛋白质编码潜力的计算方法对于在基因组注释过程中区分 mRNA 和 lncRNA 非常重要，但该任务的大多数机器学习方法都依赖于先前已知的规则来定义特征。序列到序列（seq2seq）模型，特别是使用转换器网络的模型，已被证明能够学习单词之间复杂的语法关系，从而执行自然语言翻译。为了利用生物学领域的这些进展，我们提出了一种使用深度神经网络预测蛋白质编码潜力的 seq2seq 公式，并证明与仅分类训练目标相比，同时从 RNA 到蛋白质学习翻译可以提高分类性能。受基因发现的经典信号处理方法和基于傅里叶的图像处理神经网络的启发，我们引入了 LocalFilterNet（LFNet）。LFNet 是一种具有建模编码序列中三核苷酸周期性的归纳偏差的网络架构。我们将 LFNet 纳入编码器-解码器框架中，以测试翻译任务是否可以改善转录本的分类和对其序列特征的解释。我们使用所得模型计算核苷酸分辨率的重要性得分，揭示有助于细胞机制区分 mRNA 和 lncRNA 的序列模式。最后，我们开发了一种从集成梯度（一种基于反向传播的特征归因）估计突变效应的新方法，并在这种情况下对有效逼近的难度进行了特征化。