Center for Computational Biology, University of California Berkeley, Berkeley, California, United States of America.
Department of Mathematics, University of California Berkeley, Berkeley, California, United States of America.
PLoS Comput Biol. 2024 Nov 5;20(11):e1012526. doi: 10.1371/journal.pcbi.1012526. eCollection 2024 Nov.
Protein domain annotation is typically done by predictive models such as HMMs trained on sequence motifs. However, sequence-based annotation methods are prone to error, particularly in calling domain boundaries and motifs within them. These methods are limited by a lack of structural information accessible to the model. With the advent of deep learning-based protein structure prediction, existing sequenced-based domain annotation methods can be improved by taking into account the geometry of protein structures. We develop dimensionality reduction methods to annotate repeat units of the Leucine Rich Repeat solenoid domain. The methods are able to correct mistakes made by existing machine learning-based annotation tools and enable the automated detection of hairpin loops and structural anomalies in the solenoid. The methods are applied to 127 predicted structures of LRR-containing intracellular innate immune proteins in the model plant Arabidopsis thaliana and validated against a benchmark dataset of 172 manually-annotated LRR domains.
蛋白质结构域注释通常通过基于序列模式的预测模型(如 HMM)来完成。然而,基于序列的注释方法容易出错,尤其是在调用结构域边界和其中的基序时。这些方法受到模型无法获取结构信息的限制。随着基于深度学习的蛋白质结构预测的出现,现有的基于序列的结构域注释方法可以通过考虑蛋白质结构的几何形状来得到改进。我们开发了降维方法来注释亮氨酸丰富重复螺线管结构域的重复单元。这些方法能够纠正现有基于机器学习的注释工具所犯的错误,并能够自动检测螺线管中的发夹环和结构异常。该方法应用于模式植物拟南芥中 127 个预测的含有 LRR 的细胞内先天免疫蛋白结构,并针对 172 个手动注释的 LRR 结构域的基准数据集进行验证。