Bruno Ian J, Cole Jason C, Kessler Magnus, Luo Jie, Motherwell W D Sam, Purkis Lucy H, Smith Barry R, Taylor Robin, Cooper Richard I, Harris Stephanie E, Orpen A Guy
Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, England.
J Chem Inf Comput Sci. 2004 Nov-Dec;44(6):2133-44. doi: 10.1021/ci049780b.
The crystallographically determined bond length, valence angle, and torsion angle information in the Cambridge Structural Database (CSD) has many uses. However, accessing it by means of conventional substructure searching requires nontrivial user intervention. In consequence, these valuable data have been underutilized and have not been directly accessible to client applications. The situation has been remedied by development of a new program (Mogul) for automated retrieval of molecular geometry data from the CSD. The program uses a system of keys to encode the chemical environments of fragments (bonds, valence angles, and acyclic torsions) from CSD structures. Fragments with identical keys are deemed to be chemically identical and are grouped together, and the distribution of the appropriate geometrical parameter (bond length, valence angle, or torsion angle) is computed and stored. Use of a search tree indexed on key values, together with a novel similarity calculation, then enables the distribution matching any given query fragment (or the distributions most closely matching, if an adequate exact match is unavailable) to be found easily and with no user intervention. Validation experiments indicate that, with rare exceptions, search results afford precise and unbiased estimates of molecular geometrical preferences. Such estimates may be used, for example, to validate the geometries of libraries of modeled molecules or of newly determined crystal structures or to assist structure solution from low-resolution (e.g. powder diffraction) X-ray data.
剑桥结构数据库(CSD)中通过晶体学确定的键长、价角和扭转角信息有许多用途。然而,通过传统的子结构搜索来访问这些信息需要用户进行大量干预。因此,这些有价值的数据未得到充分利用,客户端应用程序也无法直接访问。一个用于从CSD自动检索分子几何数据的新程序(Mogul)的开发弥补了这种情况。该程序使用一种键系统对CSD结构中片段(键、价角和非环扭转)的化学环境进行编码。具有相同键的片段被视为化学性质相同,并被归为一组,然后计算并存储适当几何参数(键长、价角或扭转角)的分布。使用基于键值索引的搜索树以及一种新颖的相似性计算方法,无需用户干预就能轻松找到与任何给定查询片段匹配的分布(如果没有足够的精确匹配,则找到最接近匹配的分布)。验证实验表明,除了极少数例外情况,搜索结果能够对分子几何偏好进行精确且无偏差的估计。例如,这些估计可用于验证建模分子库或新确定晶体结构的几何结构,或辅助从低分辨率(如粉末衍射)X射线数据中解析结构。