Department of Computer Science, University of Toronto, Toronto, Canada.
Department of Cell and Systems Biology, University of Toronto, Toronto, Canada.
PLoS Comput Biol. 2022 Jun 29;18(6):e1010238. doi: 10.1371/journal.pcbi.1010238. eCollection 2022 Jun.
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call "reverse homology", exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homolog from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.
广泛存在于蛋白质组中的无规则区域(IDRs),其特征难以确定,这是一个主要的挑战,因为相对而言,人们对这些区域的功能了解较少。这里,我们引入了一种针对 IDRs 的蛋白质组规模特征发现方法。我们的方法称为“反向同源性”,利用了重要功能特征在进化中保守的原理。我们将其用作深度学习的对比学习信号:给定一组同源 IDR,神经网络必须从蛋白质组中随机采样的另一组 IDR 中正确选择一个保留的同源物。我们将反向同源性与简单的架构和标准解释技术相结合,并表明该网络可以学习 IDR 的保守特征,这些特征可以解释为基序、重复序列或电荷或氨基酸倾向等整体特征。我们还表明,我们的模型可用于生成对 IDR 功能最重要的残基和区域的可视化效果,从而为未表征的 IDR 生成假说。我们的结果表明,使用无监督神经网络进行特征发现是深入了解理解较少的蛋白质序列的一种有前途的方法。