Toniato Alessandra, Vaucher Alain C, Schwaller Philippe, Laino Teodoro
IBM Research Europe Saümerstrasse 4 8803 Rüschlikon Switzerland
National Center for Competence in Research-Catalysis (NCCR-Catalysis) Zurich Switzerland.
Digit Discov. 2023 Feb 16;2(2):489-501. doi: 10.1039/d2dd00110a. eCollection 2023 Apr 11.
Over the past four years, several research groups demonstrated the combination of domain-specific language representation with recent NLP architectures to accelerate innovation in a wide range of scientific fields. Chemistry is a great example. Among the various chemical challenges addressed with language models, retrosynthesis demonstrates some of the most distinctive successes and limitations. Single-step retrosynthesis, the task of identifying reactions able to decompose a complex molecule into simpler structures, can be cast as a translation problem, in which a text-based representation of the target molecule is converted into a sequence of possible precursors. A common issue is a lack of diversity in the proposed disconnection strategies. The suggested precursors typically fall in the same reaction family, which limits the exploration of the chemical space. We present a retrosynthesis Transformer model that increases the diversity of the predictions by prepending a classification token to the language representation of the target molecule. At inference, the use of these prompt tokens allows us to steer the model towards different kinds of disconnection strategies. We show that the diversity of the predictions improves consistently, which enables recursive synthesis tools to circumvent dead ends and consequently, suggests synthesis pathways for more complex molecules.
在过去四年中,几个研究团队展示了特定领域语言表示与最新自然语言处理(NLP)架构的结合,以加速众多科学领域的创新。化学就是一个很好的例子。在用语言模型解决的各种化学挑战中,逆合成展示了一些最显著的成功与局限。单步逆合成,即识别能够将复杂分子分解为更简单结构的反应的任务,可以被视为一个翻译问题,其中目标分子的基于文本的表示被转换为一系列可能的前体。一个常见问题是所提出的断键策略缺乏多样性。建议的前体通常属于同一反应家族,这限制了对化学空间的探索。我们提出了一种逆合成Transformer模型,通过在目标分子的语言表示前添加一个分类令牌来增加预测的多样性。在推理时,使用这些提示令牌使我们能够引导模型采用不同类型的断键策略。我们表明,预测的多样性持续提高,这使递归合成工具能够避开死胡同,从而为更复杂的分子提出合成途径。