Akhondi Saber A, Muresan Sorel, Williams Antony J, Kors Jan A
Department of Medical Informatics, Erasmus University Medical Centre, P.O. Box 2040, 3000 CA Rotterdam, The Netherlands.
Food Control Department, Banat University of Agricultural Sciences and Veterinary Medicine, Calea Aradului 119, 300645 Timisoara, Romania.
J Cheminform. 2015 Nov 16;7:54. doi: 10.1186/s13321-015-0102-6. eCollection 2015.
A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers.
The ambiguity of non-systematic identifiers within databases varied from 0.1 to 15.2 % (median 2.5 %). Standardization reduced the ambiguity only to a small extent for most databases. A wide range of ambiguity existed for non-systematic identifiers that are shared between databases (17.7-60.2 %, median of 40.3 %). Removing stereochemistry information provided the largest reduction in ambiguity across databases (median reduction 13.7 percentage points).
Ambiguity of non-systematic identifiers within chemical databases is generally low, but ambiguity of non-systematic identifiers that are shared between databases, is high. Chemical structure standardization reduces the ambiguity to a limited extent. Our findings can help to improve database integration, curation, and maintenance.
目前有各种各样的化合物数据库可用于药物研究。为了检索包括结构在内的化合物信息,研究人员可以使用非系统标识符查询这些化学数据库。这些是非系统依赖标识符(例如,品牌名、通用名),通常在注册时分配给化合物。非系统标识符的正确性(即一个标识符是否与相关结构匹配)只能手动评估,这很繁琐,但可以自动检查它们的歧义性(即一个标识符是否与多个结构匹配)。在本研究中,我们对八个广泛使用的化学数据库内部和之间的非系统标识符的歧义性进行了量化。我们还研究了化学结构标准化对减少非系统标识符歧义性的影响。
数据库内非系统标识符的歧义性在0.1%至15.2%之间(中位数为2.5%)。对于大多数数据库,标准化仅在很小程度上降低了歧义性。数据库之间共享的非系统标识符存在广泛的歧义性(17.7% - 60.2%,中位数为40.3%)。去除立体化学信息在所有数据库中导致的歧义性降低最大(中位数降低13.7个百分点)。
化学数据库中非系统标识符的歧义性通常较低,但数据库之间共享的非系统标识符的歧义性较高。化学结构标准化在有限程度上降低了歧义性。我们的研究结果有助于改进数据库集成、管理和维护。