Quirós Miguel, Gražulis Saulius, Girdzijauskaitė Saulė, Merkys Andrius, Vaitkus Antanas
Departamento de Química Inorgánica, Universidad de Granada, 18071, Granada, Spain.
Institute of Biotechnology, Vilnius University, Saulėtekio al. 7, 10257, Vilnius, Lithuania.
J Cheminform. 2018 May 18;10(1):23. doi: 10.1186/s13321-018-0279-6.
Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical properties from molecular structure. In this article, the ongoing work to describe the chemical connectivity of entries contained in the Crystallography Open Database (COD) in SMILES format is reported. This collection of SMILES is publicly available for chemical (substructure) search or for any other purpose on an open-access basis, as is the COD itself. The conventions that have been followed for the representation of compounds that do not fit into the valence bond theory are outlined for the most frequently found cases. The procedure for getting the SMILES out of the CIF files starts with checking whether the atoms in the asymmetric unit are a chemically acceptable image of the compound. When they are not (molecule in a symmetry element, disorder, polymeric species,etc.), the previously published cif_molecule program is used to get such image in many cases. The program package Open Babel is then applied to get SMILES strings from the CIF files (either those directly taken from the COD or those produced by cif_molecule when applicable). The results are then checked and/or fixed by a human editor, in a computer-aided task that at present still consumes a great deal of human time. Even if the procedure still needs to be improved to make it more automatic (and hence faster), it has already yielded more than 160,000 curated chemical structures and the purpose of this article is to announce the existence of this work to the chemical community as well as to spread the use of its results.
计算机对化学分子连接性的描述对于搜索化学数据库以及从分子结构预测化学性质是必要的。在本文中,报告了正在进行的以SMILES格式描述晶体学开放数据库(COD)中条目的化学连接性的工作。这个SMILES集合可公开用于化学(子结构)搜索或基于开放获取的任何其他目的,就像COD本身一样。针对最常见的情况,概述了表示不符合价键理论的化合物所遵循的惯例。从CIF文件中获取SMILES的过程首先要检查不对称单元中的原子是否是化合物的化学可接受图像。当它们不是(处于对称元素中的分子、无序、聚合物物种等)时,在许多情况下使用先前发表的cif_molecule程序来获取这样的图像。然后应用Open Babel程序包从CIF文件中获取SMILES字符串(要么直接从COD获取,要么在适用时由cif_molecule生成)。然后由人工编辑在一项目前仍消耗大量人力时间的计算机辅助任务中对结果进行检查和/或修正。即使该程序仍需改进以使其更自动化(从而更快),它已经产生了超过160,000个经过整理的化学结构,本文的目的是向化学界宣布这项工作的存在,并推广其结果的使用。