使用深度半监督学习框架提高单序列预测方法的准确性。

Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework.

机构信息

Department of Computer Science, University College London, London WC1E 6BT, UK.

Biomedical Data Science Laboratory, The Francis Crick Institute, London NW1 1AT, UK.

出版信息

Bioinformatics. 2021 Nov 5;37(21):3744-3751. doi: 10.1093/bioinformatics/btab491.

DOI:10.1093/bioinformatics/btab491

PMID:34213528

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8570780/

Abstract

MOTIVATION

Over the past 50 years, our ability to model protein sequences with evolutionary information has progressed in leaps and bounds. However, even with the latest deep learning methods, the modelling of a critically important class of proteins, single orphan sequences, remains unsolved.

RESULTS

By taking a bioinformatics approach to semi-supervised machine learning, we develop Profile Augmentation of Single Sequences (PASS), a simple but powerful framework for building accurate single-sequence methods. To demonstrate the effectiveness of PASS we apply it to the mature field of secondary structure prediction. In doing so we develop S4PRED, the successor to the open-source PSIPRED-Single method, which achieves an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences.

AVAILABILITY AND IMPLEMENTATION

The S4PRED model is available as open source software on the PSIPRED GitHub repository (https://github.com/psipred/s4pred), along with documentation. It will also be provided as a part of the PSIPRED web service (http://bioinf.cs.ucl.ac.uk/psipred/).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在过去的 50 年中，我们利用进化信息对蛋白质序列建模的能力取得了飞速的发展。然而，即使是最新的深度学习方法，对于一类非常重要的蛋白质，即单条孤儿序列的建模仍然没有得到解决。

结果

通过采用生物信息学方法进行半监督机器学习，我们开发了单序列特征增强（PASS），这是一个简单但强大的构建准确单序列方法的框架。为了展示 PASS 的有效性，我们将其应用于二级结构预测这一成熟领域。在此过程中，我们开发了 S4PRED，它是开源 PSIPRED-Single 方法的后继者，在标准 CB513 测试中取得了前所未有的 Q3 得分为 75.3%的成绩。PASS 为新一代预测方法的发展提供了蓝图，提高了我们对单个蛋白质序列建模的能力。