Lin Tzyy-Shyang, Rebello Nathan J, Lee Guang-He, Morris Melody A, Olsen Bradley D
Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts02139, United States.
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts02139, United States.
ACS Polym Au. 2022 Dec 14;2(6):486-500. doi: 10.1021/acspolymersau.2c00009. Epub 2022 Oct 14.
BigSMILES, a line notation for encapsulating the molecular structure of stochastic molecules such as polymers, was recently proposed as a compact and readable solution for writing macromolecules. While BigSMILES strings serve as useful identifiers for reconstructing the molecular connectivity for polymers, in general, BigSMILES allows the same polymer to be codified into multiple equally valid representations. Having a canonicalization scheme that eliminates the multiplicity would be very useful in reducing time-intensive tasks like structural comparison and molecular search into simple string-matching tasks. Motivated by this, in this work, two strategies for deriving canonical representations for linear polymers are proposed. In the first approach, a canonicalization scheme is proposed to standardize the expression of BigSMILES stochastic objects, thereby standardizing the expression of overall BigSMILES strings. In the second approach, an analogy between formal language theory and the molecular ensemble of polymer molecules is drawn. Linear polymers can be converted into regular languages, and the minimal deterministic finite automaton uniquely associated with each prescribed language is used as the basis for constructing the unique text identifier associated with each distinct polymer. Overall, this work presents algorithms to convert linear polymers into unique structure-based text identifiers. The derived identifiers can be readily applied in chemical information systems for polymers and other polymer informatics applications.
BigSMILES是一种用于封装聚合物等随机分子的分子结构的线性表示法,最近被提议作为一种用于书写大分子的紧凑且可读的解决方案。虽然BigSMILES字符串可作为重建聚合物分子连接性的有用标识符,但一般来说,BigSMILES允许将同一聚合物编码为多个同样有效的表示形式。拥有一种消除这种多样性的规范化方案,对于将诸如结构比较和分子搜索等耗时任务简化为简单的字符串匹配任务将非常有用。受此启发,在这项工作中,提出了两种推导线性聚合物规范表示的策略。在第一种方法中,提出了一种规范化方案来标准化BigSMILES随机对象的表达式,从而标准化整个BigSMILES字符串的表达式。在第二种方法中,建立了形式语言理论与聚合物分子系综之间的类比。线性聚合物可以转换为正则语言,并且与每种规定语言唯一关联的最小确定有限自动机被用作构建与每个不同聚合物相关联的唯一文本标识符的基础。总体而言,这项工作提出了将线性聚合物转换为基于唯一结构的文本标识符的算法。所推导的标识符可 readily 应用于聚合物的化学信息系统及其他聚合物信息学应用中。