Rajan Kohulan, Zielesny Achim, Steinbeck Christoph
Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743, Jena, Germany.
Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany.
J Cheminform. 2024 Dec 27;16(1):146. doi: 10.1186/s13321-024-00941-x.
Naming chemical compounds systematically is a complex task governed by a set of rules established by the International Union of Pure and Applied Chemistry (IUPAC). These rules are universal and widely accepted by chemists worldwide, but their complexity makes it challenging for individuals to consistently apply them accurately. A translation method can be employed to address this challenge. Accurate translation of chemical compounds from SMILES notation into their corresponding IUPAC names is crucial, as it can significantly streamline the laborious process of naming chemical structures. Here, we present STOUT (SMILES-TO-IUPAC-name translator) V2, which addresses this challenge by introducing a transformer-based model that translates string representations of chemical structures into IUPAC names. Trained on a dataset of nearly 1 billion SMILES strings and their corresponding IUPAC names, STOUT V2 demonstrates exceptional accuracy in generating IUPAC names, even for complex chemical structures. The model's ability to capture intricate patterns and relationships within chemical structures enables it to generate precise and standardised IUPAC names. While established deterministic algorithms remain the gold standard for systematic chemical naming, our work, enabled by access to OpenEye's Lexichem software through an academic license, demonstrates the potential of neural approaches to complement existing tools in chemical nomenclature.Scientific contribution STOUT V2, built upon transformer-based models, is a significant advancement from our previous work. The web application enhances its accessibility and utility. By making the model and source code fully open and well-documented, we aim to promote unrestricted use and encourage further development.
系统地命名化合物是一项复杂的任务,由国际纯粹与应用化学联合会(IUPAC)制定的一套规则所支配。这些规则具有普遍性,被全球化学家广泛接受,但其复杂性使得个人难以始终准确地应用它们。可以采用一种翻译方法来应对这一挑战。将化合物从SMILES符号准确翻译为其相应的IUPAC名称至关重要,因为这可以显著简化命名化学结构的繁琐过程。在此,我们展示了STOUT(SMILES到IUPAC名称翻译器)V2,它通过引入基于Transformer的模型来应对这一挑战,该模型将化学结构的字符串表示转换为IUPAC名称。在一个包含近10亿个SMILES字符串及其相应IUPAC名称的数据集上进行训练后,STOUT V2在生成IUPAC名称方面表现出卓越的准确性,即使对于复杂的化学结构也是如此。该模型捕捉化学结构中复杂模式和关系的能力使其能够生成精确且标准化的IUPAC名称。虽然既定的确定性算法仍然是系统化学命名的黄金标准,但我们通过学术许可访问OpenEye的Lexichem软件所开展的工作,展示了神经方法在补充化学命名现有工具方面的潜力。科学贡献:基于Transformer模型构建的STOUT V2是我们先前工作的重大进展。该网络应用程序提高了其可访问性和实用性。通过使模型和源代码完全开放并提供详细文档,我们旨在促进无限制使用并鼓励进一步开发。