Swiss Federal Institute of Technology (ETH) , Department of Chemistry and Applied Biosciences, Vladimir-Prelog-Weg 4, CH-8093 Zurich, Switzerland.
J Chem Inf Model. 2018 Feb 26;58(2):472-479. doi: 10.1021/acs.jcim.7b00414. Epub 2018 Jan 22.
We present a generative long short-term memory (LSTM) recurrent neural network (RNN) for combinatorial de novo peptide design. RNN models capture patterns in sequential data and generate new data instances from the learned context. Amino acid sequences represent a suitable input for these machine-learning models. Generative models trained on peptide sequences could therefore facilitate the design of bespoke peptide libraries. We trained RNNs with LSTM units on pattern recognition of helical antimicrobial peptides and used the resulting model for de novo sequence generation. Of these sequences, 82% were predicted to be active antimicrobial peptides compared to 65% of randomly sampled sequences with the same amino acid distribution as the training set. The generated sequences also lie closer to the training data than manually designed amphipathic helices. The results of this study showcase the ability of LSTM RNNs to construct new amino acid sequences within the applicability domain of the model and motivate their prospective application to peptide and protein design without the need for the exhaustive enumeration of sequence libraries.
我们提出了一种用于组合从头多肽设计的生成式长短时记忆(LSTM)递归神经网络(RNN)。RNN 模型可以捕捉序列数据中的模式,并根据学习到的上下文生成新的数据实例。氨基酸序列是这些机器学习模型的合适输入。因此,在多肽序列上训练的生成式模型可以促进定制肽库的设计。我们使用带有 LSTM 单元的 RNN 对螺旋抗菌肽的模式识别进行训练,并使用得到的模型进行从头序列生成。与随机抽样的序列相比,这些序列中有 82%被预测为具有活性的抗菌肽,而这些序列的氨基酸分布与训练集相同。与人工设计的两亲性螺旋相比,生成的序列也更接近训练数据。这项研究的结果展示了 LSTM RNN 构建模型适用范围内新氨基酸序列的能力,并激励它们在无需详尽枚举序列库的情况下应用于肽和蛋白质设计。