• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过使用大语言模型进行数据挖掘来学习肽自组装的规则。

Learning the rules of peptide self-assembly through data mining with large language models.

作者信息

Yang Zhenze, Yorke Sarah K, Knowles Tuomas P J, Buehler Markus J

机构信息

Laboratory for Atomistic and Molecular Mechanics, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Room 1-165, Cambridge, MA 02139, USA.

Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA.

出版信息

Sci Adv. 2025 Mar 28;11(13):eadv1971. doi: 10.1126/sciadv.adv1971. Epub 2025 Mar 26.

DOI:10.1126/sciadv.adv1971
PMID:40138415
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11939049/
Abstract

Peptides are ubiquitous and important biomolecules that self-assemble into diverse structures. Although extensive research has explored the effects of chemical composition and exterior conditions on self-assembly, a systematic study consolidating these data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and large language model-assisted literature mining. As a result, we collect over 1000 experimental data entries with information about peptide sequence, experimental conditions, and corresponding self-assembly phases. Using the data, machine learning models are developed, demonstrating excellent accuracy (>80%) in assembly phase classification. Moreover, we fine-tune a GPT model for peptide literature mining with the developed dataset, which markedly outperforms the pretrained model in extracting information from academic publications. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the governing mechanisms.

摘要

肽是普遍存在且重要的生物分子,可自组装成多种结构。尽管已有广泛研究探讨了化学成分和外部条件对自组装的影响,但缺乏将这些数据整合起来以揭示通用规则的系统性研究。在这项工作中,我们通过人类专家的手动处理与大语言模型辅助的文献挖掘相结合的方式,精心构建了一个肽组装数据库。结果,我们收集了1000多个实验数据条目,这些条目包含肽序列、实验条件及相应自组装阶段的信息。利用这些数据,开发了机器学习模型,在组装阶段分类中显示出优异的准确率(>80%)。此外,我们使用所开发的数据集对用于肽文献挖掘的GPT模型进行了微调,在从学术出版物中提取信息方面,该模型明显优于预训练模型。这种工作流程通过指导实验工作,在探索潜在的自组装肽候选物时可提高效率,同时也加深了我们对调控机制的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/979fc29294b7/sciadv.adv1971-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/24994870f81b/sciadv.adv1971-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/3faaa3862e5a/sciadv.adv1971-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/28b0a51f3c1e/sciadv.adv1971-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/979fc29294b7/sciadv.adv1971-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/24994870f81b/sciadv.adv1971-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/3faaa3862e5a/sciadv.adv1971-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/28b0a51f3c1e/sciadv.adv1971-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38e7/11939049/979fc29294b7/sciadv.adv1971-f4.jpg

相似文献

1
Learning the rules of peptide self-assembly through data mining with large language models.通过使用大语言模型进行数据挖掘来学习肽自组装的规则。
Sci Adv. 2025 Mar 28;11(13):eadv1971. doi: 10.1126/sciadv.adv1971. Epub 2025 Mar 26.
2
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
3
Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.使用深度学习集成和微调大语言模型改进实体识别:以从VAERS和社交媒体中提取不良事件为例
J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.
4
Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.基于 GPT-4 和 Gemini 的生物医学实体关系抽取数据增强与分类优化方法
Database (Oxford). 2024 Oct 9;2024. doi: 10.1093/database/baae104.
5
Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.评估浅层和深度学习策略在 2018 n2c2 临床文本分类共享任务中的应用。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149.
6
The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.大语言模型在挖掘电子健康记录数据中的变革潜力:内容分析
JMIR Med Inform. 2025 Jan 2;13:e58457. doi: 10.2196/58457.
7
Collaborative large language models for automated data extraction in living systematic reviews.用于活体系统评价中自动数据提取的协作式大语言模型
J Am Med Inform Assoc. 2025 Apr 1;32(4):638-647. doi: 10.1093/jamia/ocae325.
8
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别:实体模型定量研究。
JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.
9
Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing.使用大语言模型从非结构化肝胆病理报告中提取和分类结构化数据:与基于规则的自然语言处理的可行性比较研究
J Clin Pathol. 2025 Jan 17;78(2):135-138. doi: 10.1136/jcp-2024-209669.
10
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统:开发研究
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.

本文引用的文献

1
ProtAgents: protein discovery large language model multi-agent collaborations combining physics and machine learning.ProtAgents:蛋白质发现大型语言模型,结合物理和机器学习的多智能体协作。
Digit Discov. 2024 May 17;3(7):1389-1409. doi: 10.1039/d4dd00013g. eCollection 2024 Jul 10.
2
BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials.受生物启发的语言模型:用于生物及受生物启发材料力学的对话式大语言模型
Adv Sci (Weinh). 2024 Mar;11(10):e2306724. doi: 10.1002/advs.202306724. Epub 2023 Dec 25.
3
Prediction of Protein Aggregation Propensity via Data-Driven Approaches.
通过数据驱动方法预测蛋白质聚集倾向
ACS Biomater Sci Eng. 2023 Nov 13;9(11):6451-6463. doi: 10.1021/acsbiomaterials.3c01001. Epub 2023 Oct 16.
4
Deep Learning Empowers the Discovery of Self-Assembling Peptides with Over 10 Trillion Sequences.深度学习赋能具有超过 10 万亿种序列的自组装肽的发现。
Adv Sci (Weinh). 2023 Nov;10(31):e2301544. doi: 10.1002/advs.202301544. Epub 2023 Sep 25.
5
Generative design of proteins based on secondary structure constraints using an attention-based diffusion model.基于二级结构约束,使用基于注意力的扩散模型进行蛋白质的生成式设计。
Chem. 2023 Jul 13;9(7):1828-1849. doi: 10.1016/j.chempr.2023.03.020. Epub 2023 Apr 20.
6
De novo design of protein structure and function with RFdiffusion.利用 RFdiffusion 从头设计蛋白质结构和功能。
Nature. 2023 Aug;620(7976):1089-1100. doi: 10.1038/s41586-023-06415-8. Epub 2023 Jul 11.
7
Accelerating the prediction and discovery of peptide hydrogels with human-in-the-loop.通过人机交互加速肽水凝胶的预测和发现。
Nat Commun. 2023 Jun 30;14(1):3880. doi: 10.1038/s41467-023-39648-2.
8
Machine learning overcomes human bias in the discovery of self-assembling peptides.机器学习克服了在自组装肽发现中的人为偏见。
Nat Chem. 2022 Dec;14(12):1427-1435. doi: 10.1038/s41557-022-01055-3. Epub 2022 Oct 31.
9
pH-Responsive Self-Assembling Peptide-Based Biomaterials: Designs and Applications.pH响应性自组装肽基生物材料:设计与应用
ACS Appl Bio Mater. 2022 May 3. doi: 10.1021/acsabm.2c00188.
10
Peptide-based nanomaterials: Self-assembly, properties and applications.基于肽的纳米材料:自组装、性质及应用
Bioact Mater. 2021 Sep 28;11:268-282. doi: 10.1016/j.bioactmat.2021.09.029. eCollection 2022 May.