Yang Zhenze, Yorke Sarah K, Knowles Tuomas P J, Buehler Markus J
Laboratory for Atomistic and Molecular Mechanics, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Room 1-165, Cambridge, MA 02139, USA.
Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA.
Sci Adv. 2025 Mar 28;11(13):eadv1971. doi: 10.1126/sciadv.adv1971. Epub 2025 Mar 26.
Peptides are ubiquitous and important biomolecules that self-assemble into diverse structures. Although extensive research has explored the effects of chemical composition and exterior conditions on self-assembly, a systematic study consolidating these data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and large language model-assisted literature mining. As a result, we collect over 1000 experimental data entries with information about peptide sequence, experimental conditions, and corresponding self-assembly phases. Using the data, machine learning models are developed, demonstrating excellent accuracy (>80%) in assembly phase classification. Moreover, we fine-tune a GPT model for peptide literature mining with the developed dataset, which markedly outperforms the pretrained model in extracting information from academic publications. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the governing mechanisms.
肽是普遍存在且重要的生物分子,可自组装成多种结构。尽管已有广泛研究探讨了化学成分和外部条件对自组装的影响,但缺乏将这些数据整合起来以揭示通用规则的系统性研究。在这项工作中,我们通过人类专家的手动处理与大语言模型辅助的文献挖掘相结合的方式,精心构建了一个肽组装数据库。结果,我们收集了1000多个实验数据条目,这些条目包含肽序列、实验条件及相应自组装阶段的信息。利用这些数据,开发了机器学习模型,在组装阶段分类中显示出优异的准确率(>80%)。此外,我们使用所开发的数据集对用于肽文献挖掘的GPT模型进行了微调,在从学术出版物中提取信息方面,该模型明显优于预训练模型。这种工作流程通过指导实验工作,在探索潜在的自组装肽候选物时可提高效率,同时也加深了我们对调控机制的理解。