Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts 02215, USA.
National Magnetic Resonance Facility at Madison and BioMagResBank, Department of Biochemistry, University of Wisconsin Madison, Madison, Wisconsin 53706, USA.
Sci Data. 2019 Feb 19;6:190023. doi: 10.1038/sdata.2019.23.
Identification of discrepant data in aggregated databases is a key step in data curation and remediation. We have applied the ALATIS approach, which is based on the international chemical shift identifier (InChI) model, to the full PubChem Compound database to generate unique and reproducible compound and atom identifiers for all entries for which three-dimensional structures were available. This exercise also served to identify entries with discrepancies between structures and chemical formulas or InChI strings. The use of unique compound identifiers and atom nomenclature should support more rigorous links between small-molecule databases including those containing atom-specific information of the type available from crystallography and spectroscopy. The comprehensive results from this analysis are publicly available through our webserver [http://alatis.nmrfam.wisc.edu/].
在聚合数据库中识别不一致的数据是数据管理和修复的关键步骤。我们已经将基于国际化学标记符 (InChI) 模型的 ALATIS 方法应用于完整的 PubChem 化合物数据库,为所有具有三维结构的条目生成唯一且可重复的化合物和原子标识符。这项工作还用于识别结构与化学公式或 InChI 字符串之间存在差异的条目。使用唯一的化合物标识符和原子命名法应该支持小分子数据库之间更严格的链接,包括那些包含晶体学和光谱学等类型的原子特定信息的数据库。通过我们的网络服务器 [http://alatis.nmrfam.wisc.edu/],可以公开获得此分析的综合结果。