• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

化学验证与标准化平台(CVSP):化学结构数据集的大规模自动验证

The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.

作者信息

Karapetyan Karen, Batchelor Colin, Sharpe David, Tkachenko Valery, Williams Antony J

机构信息

Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA.

Thomas Graham House, Science Park, 290 Milton Road, Cambridge, UK.

出版信息

J Cheminform. 2015 Jun 19;7:30. doi: 10.1186/s13321-015-0072-8. eCollection 2015.

DOI:10.1186/s13321-015-0072-8
PMID:26155308
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4494041/
Abstract

BACKGROUND

There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets.

RESULTS

The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/.

CONCLUSION

A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.

摘要

背景

目前有数百个在线数据库,存储着数以百万计的化合物及相关数据。由于可用于生成数据的化学信息学软件工具众多、各化学信息学平台之间存在细微差异以及软件用户经验不足,在线化学结构表示可能存在无数问题。为了帮助促进对来自各种来源的化学结构数据集的验证和标准化,我们为社区提供了一个基于互联网的免费平台,用于处理化合物数据集。

结果

化学验证和标准化平台(CVSP)根据一系列系统规则对化学结构表示进行验证和标准化。化学验证算法使用预定义的或用户定义的基于字典的分子模式来检测提交的分子表示中存在的问题,这些模式在化学上可疑或可能需要人工审查。为每个识别出的问题分配三个严重级别之一——信息、警告和错误——以便方便地告知用户浏览和审查其数据子集的必要性。验证过程包括对原子和键(例如,识别查询原子和键)、化合价和立体化学的验证。数据集合的标准提交格式SDF文件允许用户将数据字段映射到预定义的CVSP字段,以便将相关的SMILES和InChIs与SDF文件中包含的连接表进行交叉验证。该平台已应用于分析大量准备存入我们的ChemSpider数据库的数据集,以及为Open PHACTS项目准备数据。在这项工作中,我们回顾了DrugBank数据集(一个社区广泛使用的药物和药物靶点数据库)和ChEMBL 17数据集的自动验证结果。CVSP网站位于http://cvsp.chemspider.com/。

结论

已开发出一个用于验证和标准化各种格式化学结构表示的平台,并向社区提供,以协助和鼓励处理化学结构文件,以便生成更统一的化合物表示,用于在线数据库之间的交换和互换。虽然CVSP平台的设计具有处理数据时可使用的规则所固有的灵活性,但我们根据自己处理诸如DrugBank、ChEMBL和ChemSpider数据集等大数据集的经验,制定了一套推荐规则集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/e621ba695069/13321_2015_72_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/b91bb1dda523/13321_2015_72_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/4db60f994fa5/13321_2015_72_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/86e2144c8aa6/13321_2015_72_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/acb37a1f0df8/13321_2015_72_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/f60705929f1c/13321_2015_72_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/775cd23a77be/13321_2015_72_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/f6461059d32c/13321_2015_72_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/91d99d2b7d5e/13321_2015_72_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/bbf5121b3bab/13321_2015_72_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/296ceefcd5bd/13321_2015_72_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/eaae80abc9d5/13321_2015_72_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/e70c0ca92847/13321_2015_72_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/9036a3612ffe/13321_2015_72_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/e621ba695069/13321_2015_72_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/b91bb1dda523/13321_2015_72_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/4db60f994fa5/13321_2015_72_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/86e2144c8aa6/13321_2015_72_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/acb37a1f0df8/13321_2015_72_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/f60705929f1c/13321_2015_72_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/775cd23a77be/13321_2015_72_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/f6461059d32c/13321_2015_72_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/91d99d2b7d5e/13321_2015_72_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/bbf5121b3bab/13321_2015_72_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/296ceefcd5bd/13321_2015_72_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/eaae80abc9d5/13321_2015_72_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/e70c0ca92847/13321_2015_72_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/9036a3612ffe/13321_2015_72_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f29c/4494173/e621ba695069/13321_2015_72_Fig14_HTML.jpg

相似文献

1
The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.化学验证与标准化平台(CVSP):化学结构数据集的大规模自动验证
J Cheminform. 2015 Jun 19;7:30. doi: 10.1186/s13321-015-0072-8. eCollection 2015.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.自动与手动编目多源化学词典:对文本挖掘的影响。
J Cheminform. 2010 Mar 23;2(1):3. doi: 10.1186/1758-2946-2-3.
4
Ambit-SLN: an Open Source Software Library for Processing of Chemical Objects via SLN Linear Notation.Ambit-SLN:一个用于通过SLN线性表示法处理化学对象的开源软件库。
Mol Inform. 2021 Nov;40(11):e2100027. doi: 10.1002/minf.202100027. Epub 2021 Aug 3.
5
The Royal Society of Chemistry and the delivery of chemistry data repositories for the community.英国皇家化学学会与面向社群的化学数据储存库的交付。
J Comput Aided Mol Des. 2014 Oct;28(10):1023-30. doi: 10.1007/s10822-014-9784-5. Epub 2014 Aug 3.
6
Engineering Aspects of Olfaction嗅觉的工程学方面
7
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
8
A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides.用于基于相似性的靶点筛选引擎FastTargetPred的新ChEMBL数据集:线性四肽详尽列表的注释
Data Brief. 2022 Apr 11;42:108159. doi: 10.1016/j.dib.2022.108159. eCollection 2022 Jun.
9
RR-APET - Heart rate variability analysis software.RR-APET - 心率变异性分析软件。
Comput Methods Programs Biomed. 2020 Mar;185:105127. doi: 10.1016/j.cmpb.2019.105127. Epub 2019 Oct 12.
10
Platform for Unified Molecular Analysis: PUMA.统一分子分析平台:PUMA
J Chem Inf Model. 2017 Aug 28;57(8):1735-1740. doi: 10.1021/acs.jcim.7b00253. Epub 2017 Aug 8.

引用本文的文献

1
Three pillars for ensuring public access and integrity of chemical databases powering cheminformatics.确保化学数据库的公众访问和完整性以支持化学信息学的三大支柱。
J Cheminform. 2025 Mar 28;17(1):40. doi: 10.1186/s13321-025-00983-9.
2
Deepmol: an automated machine and deep learning framework for computational chemistry.Deepmol:一个用于计算化学的自动化机器与深度学习框架。
J Cheminform. 2024 Dec 5;16(1):136. doi: 10.1186/s13321-024-00937-7.
3
Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.

本文引用的文献

1
Scientific competency questions as the basis for semantically enriched open pharmacological space development.科学能力问题作为语义丰富的开放药理学空间开发的基础。
Drug Discov Today. 2013 Sep;18(17-18):843-52. doi: 10.1016/j.drudis.2013.05.008. Epub 2013 May 20.
2
The ChEMBL database as linked open data.ChEMBL 数据库作为链接开放数据。
J Cheminform. 2013 May 8;5(1):23. doi: 10.1186/1758-2946-5-23.
3
InChI - the worldwide chemical structure identifier standard.InChI - 全球化学结构标识符标准。
用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。
J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.
4
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
5
EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research.美国环境保护局的DSSTox数据库:支持计算毒理学研究的经过整理的化学资源的发展历程。
Comput Toxicol. 2019 Nov 1;12. doi: 10.1016/j.comtox.2019.100096.
6
High-throughput screening and Bayesian machine learning for copper-dependent inhibitors of Staphylococcus aureus.高通量筛选和贝叶斯机器学习用于金黄色葡萄球菌的铜依赖性抑制剂。
Metallomics. 2019 Mar 20;11(3):696-706. doi: 10.1039/c8mt00342d.
7
"MS-Ready" structures for non-targeted high-resolution mass spectrometry screening studies.用于非靶向高分辨率质谱筛查研究的“MS就绪”结构
J Cheminform. 2018 Aug 30;10(1):45. doi: 10.1186/s13321-018-0299-2.
8
Comparing and Validating Machine Learning Models for Mycobacterium tuberculosis Drug Discovery.比较和验证用于结核分枝杆菌药物发现的机器学习模型。
Mol Pharm. 2018 Oct 1;15(10):4346-4360. doi: 10.1021/acs.molpharmaceut.8b00083. Epub 2018 Apr 26.
9
The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching.化学开发工具包(CDK)v2.0:原子类型标注、描绘、分子式及子结构搜索。
J Cheminform. 2017 Jun 6;9(1):33. doi: 10.1186/s13321-017-0220-4.
10
Empowering pharmacoinformatics by linked life science data.通过关联生命科学数据增强药物信息学能力。
J Comput Aided Mol Des. 2017 Mar;31(3):319-328. doi: 10.1007/s10822-016-9990-4. Epub 2016 Nov 9.
J Cheminform. 2013 Jan 24;5(1):7. doi: 10.1186/1758-2946-5-7.
4
Open PHACTS: semantic interoperability for drug discovery.Open PHACTS:药物发现的语义互操作性。
Drug Discov Today. 2012 Nov;17(21-22):1188-98. doi: 10.1016/j.drudis.2012.05.016. Epub 2012 Jun 7.
5
Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation.迈向黄金标准:关于公共领域化学数据库的质量以及改善现状的方法。
Drug Discov Today. 2012 Jul;17(13-14):685-701. doi: 10.1016/j.drudis.2012.02.013. Epub 2012 Mar 8.
6
Open Babel: An open chemical toolbox.Open Babel:一个开放的化学工具箱。
J Cheminform. 2011 Oct 7;3:33. doi: 10.1186/1758-2946-3-33.
7
ChEMBL: a large-scale bioactivity database for drug discovery.ChEMBL:用于药物发现的大型生物活性数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D1100-7. doi: 10.1093/nar/gkr777. Epub 2011 Sep 23.
8
A quality alert and call for improved curation of public chemistry databases.质量警示和呼吁改进公共化学数据库的管理。
Drug Discov Today. 2011 Sep;16(17-18):747-50. doi: 10.1016/j.drudis.2011.07.007. Epub 2011 Jul 30.
9
DrugBank 3.0: a comprehensive resource for 'omics' research on drugs.药物银行3.0:药物“组学”研究的综合资源。
Nucleic Acids Res. 2011 Jan;39(Database issue):D1035-41. doi: 10.1093/nar/gkq1126. Epub 2010 Nov 8.
10
DrugBank: a knowledgebase for drugs, drug actions and drug targets.药物银行:一个关于药物、药物作用和药物靶点的知识库。
Nucleic Acids Res. 2008 Jan;36(Database issue):D901-6. doi: 10.1093/nar/gkm958. Epub 2007 Nov 29.