Nach0：多模态自然与化学语言基础模型。

nach0: multimodal natural and chemical languages foundation model.

作者信息

Livne Micha, Miftahutdinov Zulfat, Tutubalina Elena, Kuznetsov Maksim, Polykovskiy Daniil, Brundyn Annika, Jhunjhunwala Aastha, Costa Anthony, Aliper Alex, Aspuru-Guzik Alán, Zhavoronkov Alex

机构信息

NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA.

Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada.

出版信息

Chem Sci. 2024 May 8;15(22):8380-8389. doi: 10.1039/d4sc00966e. eCollection 2024 Jun 5.

DOI:10.1039/d4sc00966e

PMID:38846388

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11151847/

Abstract

Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.

摘要

大语言模型（LLMs）在各个领域极大地推动了科学进步，许多论文都展示了它们用创造性解决方案解决复杂问题的能力。我们的论文介绍了一种新的基础模型nach0，它能够解决各种化学和生物学任务：生物医学问答、命名实体识别、分子生成、分子合成、属性预测等。nach0是一个多领域、多任务的编码器-解码器大语言模型，它在来自科学文献、专利和分子字符串的未标记文本上进行预训练，以融入一系列化学和语言知识。我们采用了指令微调，即利用特定的与任务相关的指令对nach0进行微调，以完成最终的任务集。为了有效地训练nach0，我们利用了NeMo框架，实现了基础模型和大型模型版本的高效并行优化。大量实验表明，我们的模型在单领域和跨领域任务上优于当前的基准模型。此外，它能够生成高质量的分子和文本格式的输出，展示了其在多领域设置中的有效性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

Nach0：多模态自然与化学语言基础模型。

nach0: multimodal natural and chemical languages foundation model.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

Nach0：多模态自然与化学语言基础模型。

nach0: multimodal natural and chemical languages foundation model.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献