Suppr超能文献

化学验证与标准化平台(CVSP):化学结构数据集的大规模自动验证

The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.

作者信息

Karapetyan Karen, Batchelor Colin, Sharpe David, Tkachenko Valery, Williams Antony J

机构信息

Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA.

Thomas Graham House, Science Park, 290 Milton Road, Cambridge, UK.

出版信息

J Cheminform. 2015 Jun 19;7:30. doi: 10.1186/s13321-015-0072-8. eCollection 2015.

Abstract

BACKGROUND

There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.

RESULTS

The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/.

CONCLUSION

A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.

摘要

背景

目前有数百个在线数据库,存储着数以百万计的化合物及相关数据。由于可用于生成数据的化学信息学软件工具众多、各化学信息学平台之间存在细微差异以及软件用户经验不足,在线化学结构表示可能存在无数问题。为了帮助促进对来自各种来源的化学结构数据集的验证和标准化,我们为社区提供了一个基于互联网的免费平台,用于处理化合物数据集。

结果

化学验证和标准化平台(CVSP)根据一系列系统规则对化学结构表示进行验证和标准化。化学验证算法使用预定义的或用户定义的基于字典的分子模式来检测提交的分子表示中存在的问题,这些模式在化学上可疑或可能需要人工审查。为每个识别出的问题分配三个严重级别之一——信息、警告和错误——以便方便地告知用户浏览和审查其数据子集的必要性。验证过程包括对原子和键(例如,识别查询原子和键)、化合价和立体化学的验证。数据集合的标准提交格式SDF文件允许用户将数据字段映射到预定义的CVSP字段,以便将相关的SMILES和InChIs与SDF文件中包含的连接表进行交叉验证。该平台已应用于分析大量准备存入我们的ChemSpider数据库的数据集,以及为Open PHACTS项目准备数据。在这项工作中,我们回顾了DrugBank数据集(一个社区广泛使用的药物和药物靶点数据库)和ChEMBL 17数据集的自动验证结果。CVSP网站位于http://cvsp.chemspider.com/。

结论

已开发出一个用于验证和标准化各种格式化学结构表示的平台,并向社区提供,以协助和鼓励处理化学结构文件,以便生成更统一的化合物表示,用于在线数据库之间的交换和互换。虽然CVSP平台的设计具有处理数据时可使用的规则所固有的灵活性,但我们根据自己处理诸如DrugBank、ChEMBL和ChemSpider数据集等大数据集的经验,制定了一套推荐规则集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/b91bb1dda523/13321_2015_72_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验