Chen Zhuojian, Sivaraman J
Department of Biological Sciences, National University of Singapore, 14 Science Drive 4, Singapore, 117543, Singapore.
Adv Sci (Weinh). 2025 Apr;12(15):e2413689. doi: 10.1002/advs.202413689. Epub 2025 Feb 20.
Obtaining pure and homogeneous protein samples is vital for protein biology studies, yet optimizing protein expression and purification methods can be time-consuming because of variations in factors like expression conditions, buffer components, and fusion tags. With over 81 000 Protein Data Bank (PDB)-associated articles as of October 2024, manual extraction of relevant methods is impractical. To streamline this process, an automated tool is developed by incorporating a large language model (LLM) to extract and classify key data from these articles. The information extraction accuracy is enhanced by a 2-step-LLM and a 3-step-prompt. The key findings include: 1) Tris buffer is used in 49.2% of cases, followed by 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES) and phosphate buffers. 2) Polyhistidine tags dominate at 82.5%, followed by glutathione S-transferase (GST) and maltose-binding protein (MBP) tags. 3) E. coli expression is done at 16-20 °C, with induction period favoring 12-16 h (69.0%) over 3-6 h (14.3%). The statistical analyses highlight the correlation between protein properties and purification strategies. This tool is validated through two case studies: method bias for membrane protein purification, and crosslinker/detergent preferences for Cryo-Electron Microscopy sample preparation. These findings provide a valuable resource for designing protein expression and purification experiments.
获得纯净且均一的蛋白质样品对于蛋白质生物学研究至关重要,然而由于表达条件、缓冲液成分和融合标签等因素的差异,优化蛋白质表达和纯化方法可能会很耗时。截至2024年10月,蛋白质数据库(PDB)相关文章超过81000篇,手动提取相关方法是不切实际的。为了简化这一过程,通过整合大语言模型(LLM)开发了一种自动化工具,以从这些文章中提取和分类关键数据。通过两步大语言模型和三步提示提高了信息提取的准确性。主要发现包括:1)49.2%的情况使用Tris缓冲液,其次是4-(2-羟乙基)-1-哌嗪乙磺酸(HEPES)和磷酸盐缓冲液。2)多组氨酸标签占主导地位,为82.5%,其次是谷胱甘肽S-转移酶(GST)和麦芽糖结合蛋白(MBP)标签。3)大肠杆菌表达在16-20°C进行,诱导期以12-16小时(69.0%)优于3-6小时(14.3%)。统计分析突出了蛋白质性质与纯化策略之间的相关性。该工具通过两个案例研究得到验证:膜蛋白纯化的方法偏差,以及冷冻电子显微镜样品制备的交联剂/去污剂偏好。这些发现为设计蛋白质表达和纯化实验提供了宝贵的资源。