Rebello Nathan J, Lin Tzyy-Shyang, Nazeer Heeba, Olsen Bradley D
Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
Department of Computer Science, Wellesley College, 106 Central Street, Wellesley, Massachusetts 02481, United States.
J Chem Inf Model. 2023 Nov 13;63(21):6555-6568. doi: 10.1021/acs.jcim.3c00978. Epub 2023 Oct 24.
Molecular search is important in chemistry, biology, and informatics for identifying molecular structures within large data sets, improving knowledge discovery and innovation, and making chemical data FAIR (findable, accessible, interoperable, reusable). Search algorithms for polymers are significantly less developed than those for small molecules because polymer search relies on searching by polymer name, which can be challenging because polymer naming is overly broad (i.e., polyethylene), complicated for complex chemical structures, and often does not correspond to official IUPAC conventions. Chemical structure search in polymers is limited to substructures, such as monomers, without awareness of connectivity or topology. This work introduces a novel query language and graph traversal search algorithm for polymers that provides the first search method able to fully capture all of the chemical structures present in polymers. The BigSMARTS query language, an extension of the small-molecule SMARTS language, allows users to write queries that localize monomer and functional group searches to different parts of the polymer, like the middle block of a triblock, the side chain of a graft, and the backbone of a repeat unit. The substructure search algorithm is based on the traversal of graph representations of the generating functions for the stochastic graphs of polymers. Operationally, the algorithm first identifies cycles representing the monomers and then the end groups and finally performs a depth-first search to match entire subgraphs. To validate the algorithm, hundreds of queries were searched against hundreds of target chemistries and topologies from the literature, with approximately 440,000 query-target pairs. This tool provides a detailed algorithm that can be implemented in search engines to provide search results with full matching of the monomer connectivity and polymer topology.
在化学、生物学和信息学领域,分子搜索对于在大型数据集中识别分子结构、促进知识发现与创新以及使化学数据符合FAIR原则(可查找、可访问、可互操作、可重用)至关重要。聚合物的搜索算法远不如小分子的搜索算法发达,因为聚合物搜索依赖于按聚合物名称进行搜索,这具有挑战性,原因在于聚合物命名过于宽泛(例如聚乙烯),对于复杂化学结构而言很复杂,并且常常不符合国际纯粹与应用化学联合会(IUPAC)的官方惯例。聚合物中的化学结构搜索仅限于子结构,例如单体,而不考虑连接性或拓扑结构。这项工作引入了一种用于聚合物的新型查询语言和图遍历搜索算法,该算法提供了第一种能够完全捕捉聚合物中所有化学结构的搜索方法。BigSMARTS查询语言是小分子SMARTS语言的扩展,它允许用户编写查询,将单体和官能团搜索定位到聚合物的不同部分,如三嵌段聚合物的中间嵌段、接枝聚合物的侧链以及重复单元的主链。子结构搜索算法基于对聚合物随机图的生成函数的图表示进行遍历。在操作上,该算法首先识别代表单体的环,然后识别端基,最后进行深度优先搜索以匹配整个子图。为了验证该算法,针对文献中的数百种目标化学结构和拓扑结构进行了数百次查询,大约有440,000个查询 - 目标对。这个工具提供了一种详细的算法,可在搜索引擎中实现,以提供与单体连接性和聚合物拓扑结构完全匹配的搜索结果。