• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个使用RDKit的开源化学结构编目流程。

An open source chemical structure curation pipeline using RDKit.

作者信息

Bento A Patrícia, Hersey Anne, Félix Eloy, Landrum Greg, Gaulton Anna, Atkinson Francis, Bellis Louisa J, De Veij Marleen, Leach Andrew R

机构信息

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK.

T5 Informatics GmbH, Basel, 4055, Switzerland.

出版信息

J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.

DOI:10.1186/s13321-020-00456-1
PMID:33431044
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7458899/
Abstract

BACKGROUND

The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.

RESULTS

A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.

CONCLUSION

All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.

摘要

背景

ChEMBL数据库是众多公共数据库之一,这些数据库包含从不同来源整理的小分子化合物的生物活性数据。传入的化合物通常未按照一致的规则进行标准化。为了保持最终数据库的质量,并便于比较和整合来自不同来源的同一化合物的数据,数据库中的化学结构需要进行适当的标准化。

结果

利用开源工具包RDKit开发了一种化学整理流程。它由三个组件组成:一个用于测试化学结构有效性并标记任何严重错误的检查器;一个根据定义的规则和惯例对化合物进行格式化的标准化器;以及一个从化合物中去除任何盐和溶剂以生成其母体的获取母体组件。此流程已应用于ChEMBL数据库的最新版本以及来自其他来源的未整理数据集,以测试该过程的稳健性并识别数据库分子结构中的常见问题。

结论

结构流程的所有组件已免费提供给其他研究人员使用和改编以供他们自己使用。代码可在GitHub存储库中获取,也可通过ChEMBL Beaker网络服务访问。它已成功用于标准化ChEMBL数据库中的近200万种化合物,并且化合物有效性检查器已用于识别问题最严重的化合物,以便可以优先对其进行人工整理。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/e78fc388e045/13321_2020_456_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/56f59102b686/13321_2020_456_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/b2027a8fe848/13321_2020_456_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/c8adb588bf40/13321_2020_456_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/bb1522e51c9d/13321_2020_456_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/e3af649c9cef/13321_2020_456_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/e78fc388e045/13321_2020_456_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/56f59102b686/13321_2020_456_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/b2027a8fe848/13321_2020_456_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/c8adb588bf40/13321_2020_456_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/bb1522e51c9d/13321_2020_456_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/e3af649c9cef/13321_2020_456_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39d0/7460742/e78fc388e045/13321_2020_456_Fig6_HTML.jpg

相似文献

1
An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。
J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.
2
canSAR chemistry registration and standardization pipeline.癌症小分子活性数据库化学登记与标准化流程
J Cheminform. 2022 May 28;14(1):28. doi: 10.1186/s13321-022-00606-7.
3
Activity, assay and target data curation and quality in the ChEMBL database.ChEMBL数据库中的活性、测定及靶点数据整理与质量
J Comput Aided Mol Des. 2015 Sep;29(9):885-96. doi: 10.1007/s10822-015-9860-5. Epub 2015 Jul 23.
4
A document classifier for medicinal chemistry publications trained on the ChEMBL corpus.一种基于ChEMBL语料库训练的药物化学出版物文档分类器。
J Cheminform. 2014 Aug 12;6(1):40. doi: 10.1186/s13321-014-0040-8. eCollection 2014 Dec.
5
myChEMBL: a virtual machine implementation of open data and cheminformatics tools.myChEMBL:一个开源数据和 cheminformatics 工具的虚拟机实现。
Bioinformatics. 2014 Jan 15;30(2):298-300. doi: 10.1093/bioinformatics/btt666. Epub 2013 Nov 20.
6
The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets.化学验证与标准化平台(CVSP):化学结构数据集的大规模自动验证
J Cheminform. 2015 Jun 19;7:30. doi: 10.1186/s13321-015-0072-8. eCollection 2015.
7
The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.2023 年的 ChEMBL 数据库:一个涵盖多种生物活性数据类型和时间段的药物发现平台。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1180-D1192. doi: 10.1093/nar/gkad1004.
8
PPDMs-a resource for mapping small molecule bioactivities from ChEMBL to Pfam-A protein domains.PPDMs——一种用于将ChEMBL中的小分子生物活性映射到Pfam-A蛋白质结构域的资源。
Bioinformatics. 2015 Mar 1;31(5):776-8. doi: 10.1093/bioinformatics/btu711. Epub 2014 Oct 27.
9
A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides.用于基于相似性的靶点筛选引擎FastTargetPred的新ChEMBL数据集:线性四肽详尽列表的注释
Data Brief. 2022 Apr 11;42:108159. doi: 10.1016/j.dib.2022.108159. eCollection 2022 Jun.
10
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

1
MSLib: efficient generation of open multi-stage fragmentation mass spectral libraries.MSLib:高效生成开放的多阶段碎裂质谱图库。
Nat Methods. 2025 Sep 15. doi: 10.1038/s41592-025-02813-0.
2
KG-MACNF: A nonlinear cross-modal fusion model for predicting drug-target interactions via multi-relational embedding and fine-grained structure.KG-MACNF:一种通过多关系嵌入和细粒度结构预测药物-靶点相互作用的非线性跨模态融合模型。
PLoS One. 2025 Sep 9;20(9):e0331037. doi: 10.1371/journal.pone.0331037. eCollection 2025.
3
Discovery of Potential Tyrosinase Inhibitors via Machine Learning and Molecular Docking with Experimental Validation of Activity and Skin Permeation.

本文引用的文献

1
EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research.美国环境保护局的DSSTox数据库:支持计算毒理学研究的经过整理的化学资源的发展历程。
Comput Toxicol. 2019 Nov 1;12. doi: 10.1016/j.comtox.2019.100096.
2
ChEMBL: towards direct deposition of bioassay data.ChEMBL:致力于直接生成生物测定数据。
Nucleic Acids Res. 2019 Jan 8;47(D1):D930-D940. doi: 10.1093/nar/gky1075.
3
PubChem 2019 update: improved access to chemical data.PubChem 2019 年更新:改善化学数据获取。
通过机器学习和分子对接发现潜在的酪氨酸酶抑制剂,并对其活性和皮肤渗透性进行实验验证
ACS Omega. 2025 Aug 19;10(34):38922-38932. doi: 10.1021/acsomega.5c04807. eCollection 2025 Sep 2.
4
Computational insights and activity evaluation of novel SHP-2 inhibitors for targeting type 2 diabetes mellitus.用于治疗2型糖尿病的新型SHP-2抑制剂的计算洞察与活性评估
Mol Divers. 2025 Sep 4. doi: 10.1007/s11030-025-11344-x.
5
A Comparative Evaluation of Machine Learning and Deep Graph Learning for Chemical Ecotoxicological Prediction.机器学习与深度图学习用于化学生态毒理学预测的比较评估
ACS Omega. 2025 Aug 12;10(33):37549-37560. doi: 10.1021/acsomega.5c03753. eCollection 2025 Aug 26.
6
Mechanistic inhibition of FtsZ-driven bacterial cytokinesis by natural products: an integrated machine learning and advanced drug discovery approach.天然产物对FtsZ驱动的细菌胞质分裂的机制性抑制:一种集成机器学习与先进药物发现的方法
Mol Divers. 2025 Aug 29. doi: 10.1007/s11030-025-11332-1.
7
Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability.基于谷本系数拥挤距离和接受概率的多目标药物分子优化
Pharmaceuticals (Basel). 2025 Aug 20;18(8):1227. doi: 10.3390/ph18081227.
8
Machine Learning and Integrative Structural Dynamics Identify Potent ALK Inhibitors from Natural Compound Libraries.机器学习与整合结构动力学从天然化合物库中鉴定出强效ALK抑制剂。
Pharmaceuticals (Basel). 2025 Aug 10;18(8):1178. doi: 10.3390/ph18081178.
9
Application of directed message-passing neural network to predict human oral bioavailability of pharmaceuticals.定向消息传递神经网络在预测药物人体口服生物利用度中的应用。
J Comput Aided Mol Des. 2025 Aug 19;39(1):68. doi: 10.1007/s10822-025-00649-6.
10
Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation.利用具有分子优化增强图特征的图神经网络改善药物性肝损伤预测。
J Cheminform. 2025 Aug 18;17(1):124. doi: 10.1186/s13321-025-01068-3.
Nucleic Acids Res. 2019 Jan 8;47(D1):D1102-D1109. doi: 10.1093/nar/gky1033.
4
PubChem chemical structure standardization.PubChem化学结构标准化
J Cheminform. 2018 Aug 10;10(1):36. doi: 10.1186/s13321-018-0293-8.
5
Selectivity Challenges in Docking Screens for GPCR Targets and Antitargets.针对 GPCR 靶标和抗靶标进行对接筛选的选择性挑战。
J Med Chem. 2018 Aug 9;61(15):6830-6845. doi: 10.1021/acs.jmedchem.8b00718. Epub 2018 Jul 24.
6
Novel non-ATP competitive small molecules targeting the CK2 α/β interface.靶向 CK2α/β 界面的新型非 ATP 竞争小分子。
Bioorg Med Chem. 2018 Jul 15;26(11):3016-3020. doi: 10.1016/j.bmc.2018.05.011. Epub 2018 May 9.
7
Design and Synthesis of Novel Deuterated Ligands Functionally Selective for the γ-Aminobutyric Acid Type A Receptor (GABAR) α6 Subtype with Improved Metabolic Stability and Enhanced Bioavailability.新型氘代配体的设计与合成,对 γ-氨基丁酸 A 型受体(GABAR)α6 亚型具有功能选择性,代谢稳定性提高,生物利用度增强。
J Med Chem. 2018 Mar 22;61(6):2422-2446. doi: 10.1021/acs.jmedchem.7b01664. Epub 2018 Mar 6.
8
The CompTox Chemistry Dashboard: a community data resource for environmental chemistry.综合毒理化学仪表盘:环境化学的社区数据资源。
J Cheminform. 2017 Nov 28;9(1):61. doi: 10.1186/s13321-017-0247-6.
9
Legacy data sharing to improve drug safety assessment: the eTOX project.遗留数据共享以改进药物安全评估:eTOX 项目。
Nat Rev Drug Discov. 2017 Dec;16(12):811-812. doi: 10.1038/nrd.2017.177. Epub 2017 Oct 13.
10
BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology.2015年的BindingDB:一个用于药物化学、计算化学和系统药理学的公共数据库。
Nucleic Acids Res. 2016 Jan 4;44(D1):D1045-53. doi: 10.1093/nar/gkv1072. Epub 2015 Oct 19.