Machine Learning Department, NEC Labs America, Princeton, New Jersey, United States of America.
PLoS One. 2012;7(3):e32235. doi: 10.1371/journal.pone.0032235. Epub 2012 Mar 26.
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
各种功能重要的蛋白质性质,如二级结构、跨膜拓扑和溶剂可及性,都可以编码为氨基酸的标记。事实上,从一级氨基酸序列预测这些性质是计算生物学的核心项目之一。因此,已经开发了许多方法来预测这些性质;然而,大多数这样的方法都专注于一次解决单个任务。受自然语言处理领域最近成功工作的启发,我们提出使用多任务学习来训练一个单一的联合模型,该模型利用了这些各种标记任务之间的依赖性。我们描述了一种深度神经网络架构,该架构给定一个蛋白质序列,输出许多预测的局部性质,包括二级结构、溶剂可及性、跨膜拓扑、信号肽和 DNA 结合残基。该网络以监督的方式在所有这些任务上进行联合训练,并辅以一种新颖的半监督学习形式,即模型被训练来区分天然和合成蛋白质序列中的局部模式。网络的任务独立架构避免了对特定任务的特征工程的需求。我们证明,对于我们考虑的所有任务,与单一任务神经网络方法相比,我们的方法在性能上都有统计学上的显著提高,并且得到的模型实现了最先进的性能。