Guo Minghao, Shou Wan, Makatura Liane, Erps Timothy, Foshey Michael, Matusik Wojciech
Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
CUHK Multimedia Lab, The Chinese University of Hong Kong, Sha Tin, Hong Kong.
Adv Sci (Weinh). 2022 Aug;9(23):e2101864. doi: 10.1002/advs.202101864. Epub 2022 Jun 9.
Polymers are widely studied materials with diverse properties and applications determined by molecular structures. It is essential to represent these structures clearly and explore the full space of achievable chemical designs. However, existing approaches cannot offer comprehensive design models for polymers because of their inherent scale and structural complexity. Here, a parametric, context-sensitive grammar designed specifically for polymers (PolyGrammar) is proposed. Using the symbolic hypergraph representation and 14 simple production rules, PolyGrammar can represent and generate all valid polyurethane structures. An algorithm is presented to translate any polyurethane structure from the popular Simplified Molecular-Input Line-entry System (SMILES) string format into the PolyGrammar representation. The representative power of PolyGrammar is tested by translating a dataset of over 600 polyurethane samples collected from the literature. Furthermore, it is shown that PolyGrammar can be easily extended to other copolymers and homopolymers. By offering a complete, explicit representation scheme and an explainable generative model with validity guarantees, PolyGrammar takes an essential step toward a more comprehensive and practical system for polymer discovery and exploration. As the first bridge between formal languages and chemistry, PolyGrammar also serves as a critical blueprint to inform the design of similar grammars for other chemistries, including organic and inorganic molecules.
聚合物是一类被广泛研究的材料,其具有由分子结构决定的多样性质和应用。清晰地表示这些结构并探索可实现的化学设计的完整空间至关重要。然而,由于其固有的尺度和结构复杂性,现有方法无法为聚合物提供全面的设计模型。在此,提出了一种专门为聚合物设计的参数化、上下文敏感语法(PolyGrammar)。使用符号超图表示和14条简单的产生式规则,PolyGrammar可以表示并生成所有有效的聚氨酯结构。提出了一种算法,用于将来自流行的简化分子输入线性输入系统(SMILES)字符串格式的任何聚氨酯结构转换为PolyGrammar表示。通过翻译从文献中收集的600多个聚氨酯样品的数据集来测试PolyGrammar的代表性能力。此外,还表明PolyGrammar可以轻松扩展到其他共聚物和均聚物。通过提供一个完整、明确的表示方案以及一个具有有效性保证的可解释生成模型,PolyGrammar朝着建立一个更全面、实用的聚合物发现和探索系统迈出了重要一步。作为形式语言与化学之间的第一座桥梁,PolyGrammar也为设计包括有机和无机分子在内的其他化学物质的类似语法提供了关键蓝图。