Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA.
Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA.
Structure. 2024 Aug 8;32(8):1260-1268.e3. doi: 10.1016/j.str.2024.04.010. Epub 2024 May 2.
Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR-BERT's ability to use contextual information.
尽管没有刚性结构,但蛋白质中的无规则区域(IDR)在细胞功能中起着重要作用,包括介导蛋白质-蛋白质相互作用。因此,准确地计算注释 IDR 非常重要。在这项研究中,我们提出了使用来自 Transformer 的双向编码器表示(DR-BERT)进行无序区域预测,这是一种紧凑的蛋白质语言模型。与大多数流行的工具不同,DR-BERT 是在未注释的蛋白质上进行预训练的,并通过不依赖于明确的进化或生物物理数据来训练预测 IDR。尽管如此,DR-BERT 在蛋白质固有无序性的关键评估(CAID)评估数据集上的表现明显优于现有方法,并在 CAID 2 数据集的四个测试案例中的两个上优于竞争对手,同时在其他案例中保持竞争力。这种性能是由于在预训练期间学习的信息和 DR-BERT 使用上下文信息的能力。