Long Xiyao, Jeliazkov Jeliazko R, Gray Jeffrey J
Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, United States of America.
Program in Molecular Biophysics, Johns Hopkins University, Baltimore, MD, United States of America.
PeerJ. 2019 Jan 11;7:e6179. doi: 10.7717/peerj.6179. eCollection 2019.
Antibodies are proteins generated by the adaptive immune system to recognize and counteract a plethora of pathogens through specific binding. This adaptive binding is mediated by structural diversity in the six complementary determining region (CDR) loops (H1, H2, H3, L1, L2 and L3), which also makes accurate structural modeling of CDRs challenging. Both homology and modeling approaches have been used; to date, the former has achieved greater accuracy for the non-H3 loops. The homology modeling of non-H3 CDRs is more accurate because non-H3 CDR loops of the same length and type can be grouped into a few structural clusters. Most antibody-modeling suites utilize homology modeling for the non-H3 CDRs, differing only in the alignment algorithm and how/if they utilize structural clusters. While RosettaAntibody and SAbPred do not explicitly assign query CDR sequences to clusters, two other approaches, PIGS and Kotai Antibody Builder, utilize sequence-based rules to assign CDR sequences to clusters. While the manually curated sequence rules can identify better structural templates, because their curation requires extensive literature search and human effort, they lag behind the deposition of new antibody structures and are infrequently updated. In this study, we propose a machine learning approach (Gradient Boosting Machine [GBM]) to learn the structural clusters of non-H3 CDRs from sequence alone. The GBM method simplifies feature selection and can easily integrate new data, compared to manual sequence rule curation. We compare the classification results using the GBM method to that of RosettaAntibody in a 3-repeat 10-fold cross-validation (CV) scheme on the cluster-annotated antibody database PyIgClassify and we observe an improvement in the classification accuracy of the concerned loops from 84.5% ± 0.24% to 88.16% ± 0.056%. The GBM models reduce the errors in specific cluster membership misclassifications when the involved clusters have relatively abundant data. Based on the factors identified, we suggest methods that can enrich structural classes with sparse data to further improve prediction accuracy in future studies.
抗体是适应性免疫系统产生的蛋白质,通过特异性结合识别并对抗多种病原体。这种适应性结合由六个互补决定区(CDR)环(H1、H2、H3、L1、L2和L3)的结构多样性介导,这也使得CDR的精确结构建模具有挑战性。同源建模和从头建模方法都已被使用;迄今为止,前者在非H3环上取得了更高的准确性。非H3 CDR的同源建模更准确,因为相同长度和类型的非H3 CDR环可以被分组到几个结构簇中。大多数抗体建模套件对非H3 CDR使用同源建模,仅在比对算法以及如何/是否使用结构簇方面有所不同。虽然RosettaAntibody和SAbPred没有明确将查询CDR序列分配到簇中,但另外两种方法PIGS和Kotai Antibody Builder使用基于序列的规则将CDR序列分配到簇中。虽然人工策划的序列规则可以识别更好的结构模板,但由于其策划需要广泛的文献搜索和人力,它们落后于新抗体结构的沉积且很少更新。在本研究中,我们提出一种机器学习方法(梯度提升机[GBM]),仅从序列中学习非H3 CDR的结构簇。与人工序列规则策划相比,GBM方法简化了特征选择,并且可以轻松整合新数据。我们在带有簇注释的抗体数据库PyIgClassify上,在一个3重复10倍交叉验证(CV)方案中,将使用GBM方法的分类结果与RosettaAntibody的结果进行比较,我们观察到相关环的分类准确率从84.5%±0.24%提高到了88.16%±0.056%。当相关簇具有相对丰富的数据时,GBM模型减少了特定簇成员错误分类中的误差。基于所确定的因素,我们建议在未来研究中可以用稀疏数据丰富结构类别的方法,以进一步提高预测准确性。