Department of Computer Science, University College London, London WC1E 6BT, UK.
Biomedical Data Science Laboratory, The Francis Crick Institute, London NW1 1AT, UK.
Bioinformatics. 2021 Nov 5;37(21):3744-3751. doi: 10.1093/bioinformatics/btab491.
Over the past 50 years, our ability to model protein sequences with evolutionary information has progressed in leaps and bounds. However, even with the latest deep learning methods, the modelling of a critically important class of proteins, single orphan sequences, remains unsolved.
By taking a bioinformatics approach to semi-supervised machine learning, we develop Profile Augmentation of Single Sequences (PASS), a simple but powerful framework for building accurate single-sequence methods. To demonstrate the effectiveness of PASS we apply it to the mature field of secondary structure prediction. In doing so we develop S4PRED, the successor to the open-source PSIPRED-Single method, which achieves an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences.
The S4PRED model is available as open source software on the PSIPRED GitHub repository (https://github.com/psipred/s4pred), along with documentation. It will also be provided as a part of the PSIPRED web service (http://bioinf.cs.ucl.ac.uk/psipred/).
Supplementary data are available at Bioinformatics online.
在过去的 50 年中,我们利用进化信息对蛋白质序列建模的能力取得了飞速的发展。然而,即使是最新的深度学习方法,对于一类非常重要的蛋白质,即单条孤儿序列的建模仍然没有得到解决。
通过采用生物信息学方法进行半监督机器学习,我们开发了单序列特征增强(PASS),这是一个简单但强大的构建准确单序列方法的框架。为了展示 PASS 的有效性,我们将其应用于二级结构预测这一成熟领域。在此过程中,我们开发了 S4PRED,它是开源 PSIPRED-Single 方法的后继者,在标准 CB513 测试中取得了前所未有的 Q3 得分为 75.3%的成绩。PASS 为新一代预测方法的发展提供了蓝图,提高了我们对单个蛋白质序列建模的能力。
S4PRED 模型作为开源软件在 PSIPRED GitHub 存储库(https://github.com/psipred/s4pred)上提供,同时提供文档。它也将作为 PSIPRED 网络服务的一部分提供(http://bioinf.cs.ucl.ac.uk/psipred/)。
补充数据可在“Bioinformatics”在线获取。