Keresztes László, Szögi Evelin, Varga Bálint, Farkas Viktor, Perczel András, Grolmusz Vince
PIT Bioinformatics Group, Eötvös University, Budapest H-1117, Hungary.
MTA-ELTE Protein Modeling Research Group, Budapest H-1117, Hungary.
ACS Omega. 2022 Sep 27;7(40):35532-35537. doi: 10.1021/acsomega.2c02513. eCollection 2022 Oct 11.
Hexapeptides are widely applied as a model system for studying the amyloid-forming properties of polypeptides, including proteins. Recently, large experimental databases have become publicly available with amyloidogenic labels. Using these data sets for training and testing purposes, one may build artificial intelligence (AI)-based classifiers for predicting the amyloid state of peptides. In our previous work ( , , 500), we described the Support Vector Machine (SVM)-based Budapest Amyloid Predictor (https://pitgroup.org/bap). Here, we apply the Budapest Amyloid Predictor for discovering numerous amyloidogenic and nonamyloidogenic hexapeptide patterns with accuracy between 80% and 84%, as surprising and succinct novel rules for further understanding the amyloid state of peptides. For example, we have shown that for any independently mutated residue (position marked by "x"), the patterns CxFLWx, FxFLFx, or xxIVIV are predicted to be amyloidogenic, while those of PxDxxx, xxKxEx, and xxPQxx are nonamyloidogenic. We note that each amyloidogenic pattern with two x's (e.g.,CxFLWx) describes succinctly 20 = 400 hexapeptides, while the nonamyloidogenic patterns comprising four point mutations (e.g.,PxDxxx) give 20 = 160 000 hexapeptides in total. We also examine the restricted substitutions for positions "x" from subclasses of proteinogenic amino acid residues; for example, if "x" is substituted with hydrophobic amino acids, then there exist patterns containing three x's, like MxVVxx, predicted to be amyloidogenic. If we can choose for the x positions any hydrophobic amino acids, except the "structure breaker" proline, then we get amyloid patterns with five x positions, for example, xxxFxx, each corresponding to 32 768 hexapeptides. To our knowledge, no similar applications of artificial intelligence tools or succinct amyloid patterns were described before the present work.
六肽被广泛用作研究多肽(包括蛋白质)形成淀粉样蛋白特性的模型系统。最近,带有淀粉样蛋白生成标签的大型实验数据库已公开可用。利用这些数据集进行训练和测试,人们可以构建基于人工智能(AI)的分类器来预测肽的淀粉样状态。在我们之前的工作(,,500)中,我们描述了基于支持向量机(SVM)的布达佩斯淀粉样蛋白预测器(https://pitgroup.org/bap)。在这里,我们应用布达佩斯淀粉样蛋白预测器来发现众多淀粉样生成和非淀粉样生成的六肽模式,准确率在80%至84%之间,这些规则令人惊讶且简洁,有助于进一步理解肽的淀粉样状态。例如,我们已经表明,对于任何独立突变的残基(位置用“x”标记),模式CxFLWx、FxFLFx或xxIVIV被预测为淀粉样生成,而PxDxxx、xxKxEx和xxPQxx则是非淀粉样生成。我们注意到,每个带有两个x的淀粉样生成模式(例如CxFLWx)简洁地描述了20² = 400种六肽,而包含四个点突变的非淀粉样生成模式(例如PxDxxx)总共给出20⁴ = 160 000种六肽。我们还研究了来自蛋白质ogenic氨基酸残基亚类对“x”位置的受限取代;例如,如果“x”被疏水氨基酸取代,那么存在包含三个x的模式,如MxVVxx,被预测为淀粉样生成。如果我们可以为x位置选择任何疏水氨基酸,除了“结构破坏者”脯氨酸,那么我们会得到带有五个x位置的淀粉样模式,例如xxxFxx,每个对应32 768种六肽。据我们所知,在本工作之前,没有描述过人工智能工具的类似应用或简洁的淀粉样模式。