Suppr超能文献

通过使用大语言模型进行数据挖掘来学习肽自组装的规则。

Learning the rules of peptide self-assembly through data mining with large language models.

作者信息

Yang Zhenze, Yorke Sarah K, Knowles Tuomas P J, Buehler Markus J

机构信息

Laboratory for Atomistic and Molecular Mechanics, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Room 1-165, Cambridge, MA 02139, USA.

Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA.

出版信息

Sci Adv. 2025 Mar 28;11(13):eadv1971. doi: 10.1126/sciadv.adv1971. Epub 2025 Mar 26.

Abstract

Peptides are ubiquitous and important biomolecules that self-assemble into diverse structures. Although extensive research has explored the effects of chemical composition and exterior conditions on self-assembly, a systematic study consolidating these data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and large language model-assisted literature mining. As a result, we collect over 1000 experimental data entries with information about peptide sequence, experimental conditions, and corresponding self-assembly phases. Using the data, machine learning models are developed, demonstrating excellent accuracy (>80%) in assembly phase classification. Moreover, we fine-tune a GPT model for peptide literature mining with the developed dataset, which markedly outperforms the pretrained model in extracting information from academic publications. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the governing mechanisms.

摘要

肽是普遍存在且重要的生物分子,可自组装成多种结构。尽管已有广泛研究探讨了化学成分和外部条件对自组装的影响,但缺乏将这些数据整合起来以揭示通用规则的系统性研究。在这项工作中,我们通过人类专家的手动处理与大语言模型辅助的文献挖掘相结合的方式,精心构建了一个肽组装数据库。结果,我们收集了1000多个实验数据条目,这些条目包含肽序列、实验条件及相应自组装阶段的信息。利用这些数据,开发了机器学习模型,在组装阶段分类中显示出优异的准确率(>80%)。此外,我们使用所开发的数据集对用于肽文献挖掘的GPT模型进行了微调,在从学术出版物中提取信息方面,该模型明显优于预训练模型。这种工作流程通过指导实验工作,在探索潜在的自组装肽候选物时可提高效率,同时也加深了我们对调控机制的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/24994870f81b/sciadv.adv1971-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验