Dolciami Daniela, Villasclaras-Fernandez Eloy, Kannas Christos, Meniconi Mirco, Al-Lazikani Bissan, Antolin Albert A
Department of Data Science, The Institute of Cancer Research, London, SM2 5NG, UK.
Cancer Research UK Cancer Therapeutics Unit, The Institute of Cancer Research, London, SM2 5NG, UK.
J Cheminform. 2022 May 28;14(1):28. doi: 10.1186/s13321-022-00606-7.
Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.
We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds' hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL's RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem's OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.
We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline .
整合来自众多公共资源的药物化学数据,在学术药物发现和转化研究中变得愈发重要,因为它能将大量与化合物相关的重要知识汇聚一处。然而,不同数据源可能以多种形式(如互变异构体、外消旋体等)报告相同或相关的化合物,这凸显了以层次结构组织相关化合物的必要性,以便向用户提示可能相关的重要生物活性数据。为生成这些化合物层次结构,我们开发并实施了canSARchem,这是一种新的化合物注册和标准化流程,作为canSAR公共知识库的一部分。canSARchem基于先前开发的ChEMBL和PubChem流程构建,并使用KNIME开发。我们描述了这个公开可用的流程,并举例说明了使用层次结构进行生物活性数据探索的优势和局限性。最后,我们确定了FDA批准药物中的规范化富集情况,说明了我们方法的益处。
我们在KNIME中创建了一个化学注册和标准化流程,并将其免费提供给研究界。该流程包括五个步骤来注册化合物并创建化合物层次结构:1. 结构检查器;2. 标准化;3. 生成标准互变异构体和代表性结构;4. 去除盐;5. 生成抽象结构以生成化合物层次结构。与ChEMBL的RDKit流程不同,我们在获取母结构之前进行化合物规范化,这与PubChem的OpenEye流程类似。与PubChem和ChEMBL相比,canSARchem的拒绝率更低。我们使用我们的流程来评估将化合物分组为层次结构对生物活性数据探索的影响。我们发现,与大多数生物活性化合物相比,FDA批准的药物对规范化表现出统计学上显著的敏感性,这证明了这一步骤的重要性。
我们使用canSARchem对上传到canSAR中的所有化合物(超过300万种)进行标准化,实现高效的数据整合,并能快速识别具有有用生物活性数据的替代化合物形式。与PubChem和ChEMBL流程的比较表明,在化合物标准化方面性能相当,但只有PubChem和canSAR能规范化互变异构体,且canSAR的拒绝率略低。我们的结果凸显了化合物层次结构对生物活性数据探索的重要性。我们根据知识共享署名 - 相同方式共享4.0国际许可协议(CC BY - SA 4.0)在https://gitlab.icr.ac.uk/cansar - public/compound - registration - pipeline上提供canSARchem。