Rao Roshan, Bhattacharya Nicholas, Thomas Neil, Duan Yan, Chen Xi, Canny John, Abbeel Pieter, Song Yun S
UC Berkeley.
covariant.ai.
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.
Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
将机器学习应用于蛋白质序列是一个日益热门的研究领域。由于获取有监督蛋白质标签的成本高昂,蛋白质的半监督学习已成为一种重要的范式,但目前关于数据集和标准化评估技术的文献较为零散。为推动该领域的发展,我们引入了蛋白质嵌入评估任务(TAPE),这是一组分布在蛋白质生物学不同领域的五个与生物学相关的半监督学习任务。我们将任务整理成特定的训练、验证和测试集,以确保每个任务都能测试可迁移到实际场景的生物学相关泛化能力。我们对一系列半监督蛋白质表示学习方法进行了基准测试,这些方法涵盖了近期的工作以及经典的序列学习技术。我们发现自监督预训练对所有任务中的几乎所有模型都有帮助,在某些情况下性能提升了一倍多。尽管有这种提升,但在一些情况下,自监督预训练学习到的特征仍落后于最先进的非神经技术提取的特征。这种性能差距为创新架构设计和改进建模范式提供了巨大机遇,以便更好地捕捉生物序列中的信号。TAPE将帮助机器学习社区将精力集中在与科学相关的问题上。为此,运行这些实验所使用的所有数据和代码可在https://github.com/songlab-cal/tape获取。