Suppr超能文献

使用断键提示使逆合成语言模型无偏差

Unbiasing Retrosynthesis Language Models with Disconnection Prompts.

作者信息

Thakkar Amol, Vaucher Alain C, Byekwaso Andrea, Schwaller Philippe, Toniato Alessandra, Laino Teodoro

机构信息

IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.

National Center for Competence in Research-Catalysis (NCCR-Catalysis), 8093 Zürich, Switzerland.

出版信息

ACS Cent Sci. 2023 Jul 5;9(7):1488-1498. doi: 10.1021/acscentsci.3c00372. eCollection 2023 Jul 26.

Abstract

Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule we can steer the model to propose a broader set of precursors, thereby overcoming training data biases in retrosynthetic recommendations and achieving a 39% performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them greater control over the disconnection predictions, which results in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a two-stage schema consisting of automatic identification of disconnection sites, followed by prediction of reactant sets, thereby achieving a considerable improvement in class diversity compared with the baseline. The approach is effective in mitigating prediction biases derived from training data. This provides a wider variety of usable building blocks and improves the end user's digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is critical.

摘要

数据驱动的逆合成方法在用户交互、预测的多样性以及非直观断键策略的推荐方面存在局限性。在此,我们将自然语言处理中基于提示的推理概念扩展到化学语言建模任务中。我们表明,通过使用描述分子中断键位点的提示,我们可以引导模型提出更广泛的前体集合,从而克服逆合成推荐中的训练数据偏差,并在基线基础上实现39%的性能提升。断键提示的使用首次赋予了化学家对断键预测更大的控制权,从而产生更多样化和创造性的推荐。此外,我们提出了一种两阶段模式,取代人工参与策略,该模式包括自动识别断键位点,然后预测反应物集合,从而与基线相比在类别多样性方面实现了显著提升。该方法有效地减轻了由训练数据产生的预测偏差。这提供了更广泛的可用构建模块,并改善了终端用户的数字体验。我们展示了其在从传统反应到酶促反应等不同化学领域的应用,其中底物特异性至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b654/10390024/108e7e38946c/oc3c00372_0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验