Livne Micha, Miftahutdinov Zulfat, Tutubalina Elena, Kuznetsov Maksim, Polykovskiy Daniil, Brundyn Annika, Jhunjhunwala Aastha, Costa Anthony, Aliper Alex, Aspuru-Guzik Alán, Zhavoronkov Alex
NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA.
Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada.
Chem Sci. 2024 May 8;15(22):8380-8389. doi: 10.1039/d4sc00966e. eCollection 2024 Jun 5.
Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
大语言模型(LLMs)在各个领域极大地推动了科学进步,许多论文都展示了它们用创造性解决方案解决复杂问题的能力。我们的论文介绍了一种新的基础模型nach0,它能够解决各种化学和生物学任务:生物医学问答、命名实体识别、分子生成、分子合成、属性预测等。nach0是一个多领域、多任务的编码器-解码器大语言模型,它在来自科学文献、专利和分子字符串的未标记文本上进行预训练,以融入一系列化学和语言知识。我们采用了指令微调,即利用特定的与任务相关的指令对nach0进行微调,以完成最终的任务集。为了有效地训练nach0,我们利用了NeMo框架,实现了基础模型和大型模型版本的高效并行优化。大量实验表明,我们的模型在单领域和跨领域任务上优于当前的基准模型。此外,它能够生成高质量的分子和文本格式的输出,展示了其在多领域设置中的有效性。