Monzon Vivian, Haft Daniel H, Bateman Alex
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA.
Bioinform Adv. 2022 Jan 9;2(1):vbab043. doi: 10.1093/bioadv/vbab043. eCollection 2022.
The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities. In this article, we investigate the AntiFam resource, which contains 250 protein sequence families that we believe to be spurious protein translations. We would not expect proteins belonging to these families to fold into well-ordered globular structures. To test this hypothesis, we have attempted to computationally determine the structure of a representative sequence from all AntiFam 6.0 families.
Although the large majority of families showed no evidence of globular structure, we have identified one example for which a globular structure is predicted. Proteins in this AntiFam entry indeed seem likely to be proteins, based on additional considerations, and thus AlphaFold provides a useful quality control for the AntiFam database. Conversely, known spurious proteins offer useful set of quality controls for AlphaFold. We have identified a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences. Of the 131 AntiFam representative sequences <100 amino acids in length, AlphaFold predicts a mean pLDDT of 80 or greater for six of them. Thus, particular care should be taken when applying AlphaFold to short protein sequences.
The AlphaFold predictions for representative sequences can be found at the following URL: https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro.
Supplementary data are available at online.
AlphaFold 2.0的发布彻底改变了我们从序列确定蛋白质结构的能力。这个工具也意外地带来了许多意想不到的机会。在本文中,我们研究了AntiFam资源,它包含250个蛋白质序列家族,我们认为这些是错误的蛋白质翻译。我们预计属于这些家族的蛋白质不会折叠成有序的球状结构。为了验证这一假设,我们试图通过计算确定所有AntiFam 6.0家族中一个代表性序列的结构。
虽然绝大多数家族没有显示出球状结构的证据,但我们发现了一个预测有球状结构的例子。基于其他考虑,这个AntiFam条目中的蛋白质似乎确实可能是蛋白质,因此AlphaFold为AntiFam数据库提供了有用的质量控制。相反,已知的错误蛋白质为AlphaFold提供了一组有用的质量控制。我们发现了一个趋势,即较短序列的平均结构预测置信度得分pLDDT更高。在长度小于100个氨基酸的131个AntiFam代表性序列中,AlphaFold对其中6个序列预测的平均pLDDT为80或更高。因此,在将AlphaFold应用于短蛋白质序列时应格外小心。
代表性序列的AlphaFold预测可在以下网址找到:https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro。
补充数据可在网上获取。