Hu Junjie, Wu Peng, Li Yulin, Li Qi, Wang Shiyi, Liu Yang, Qian Kun, Yang Guang
Bioengineering Department and Imperial-X, Imperial College London, London W12 7SL, UK.
School of Chemistry and Chemical Engineering, Ningxia University, Yinchuan 750014, China.
Pharmaceuticals (Basel). 2024 Sep 30;17(10):1300. doi: 10.3390/ph17101300.
As large language models continue to expand in size and diversity, their substantial potential and the relevance of their applications are increasingly being acknowledged. The rapid advancement of these models also holds profound implications for the long-term design of stimulus-responsive materials used in drug delivery. The large model used Hugging Face's Transformers package with BigBird, Gemma, and GPT NeoX architectures. Pre-training used the PubChem dataset, and fine-tuning used QM7b. Chemist instruction training was based on Direct Preference Optimization. Drug Likeness, Synthetic Accessibility, and PageRank Scores were used to filter molecules. All computational chemistry simulations were performed using ORCA and Time-Dependent Density-Functional Theory. To optimize large models for extensive dataset processing and comprehensive learning akin to a chemist's intuition, the integration of deeper chemical insights is imperative. Our study initially compared the performance of BigBird, Gemma, GPT NeoX, and others, specifically focusing on the design of photoresponsive drug delivery molecules. We gathered excitation energy data through computational chemistry tools and further investigated light-driven isomerization reactions as a critical mechanism in drug delivery. Additionally, we explored the effectiveness of incorporating human feedback into reinforcement learning to imbue large models with chemical intuition, enhancing their understanding of relationships involving -N=N- groups in the photoisomerization transitions of photoresponsive molecules. We implemented an efficient design process based on structural knowledge and data, driven by large language model technology, to obtain a candidate dataset of specific photoswitchable molecules. However, the lack of specialized domain datasets remains a challenge for maximizing model performance.
随着大语言模型在规模和多样性上不断扩展,其巨大潜力及其应用的相关性日益得到认可。这些模型的快速发展对用于药物递送的刺激响应材料的长期设计也具有深远影响。该大模型使用了Hugging Face的Transformers包以及BigBird、Gemma和GPT NeoX架构。预训练使用了PubChem数据集,微调使用了QM7b。化学家指令训练基于直接偏好优化。使用药物相似性、合成可及性和PageRank分数来筛选分子。所有计算化学模拟均使用ORCA和含时密度泛函理论进行。为了针对类似于化学家直觉的广泛数据集处理和全面学习来优化大模型,整合更深入的化学见解势在必行。我们的研究最初比较了BigBird、Gemma、GPT NeoX等模型的性能,特别关注光响应药物递送分子的设计。我们通过计算化学工具收集了激发能数据,并进一步研究了光驱动异构化反应作为药物递送中的关键机制。此外,我们探索了将人类反馈纳入强化学习以赋予大模型化学直觉的有效性,增强它们对光响应分子光异构化转变中涉及-N=N-基团关系的理解。我们基于结构知识和数据实施了一个高效的设计过程,由大语言模型技术驱动,以获得特定光开关分子的候选数据集。然而,缺乏专门的领域数据集仍然是最大化模型性能的一个挑战。