实现可互操作和可重现的定量构效关系分析：数据集的交换。

Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

机构信息

Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.

出版信息

J Cheminform. 2010 Jun 30;2(1):5. doi: 10.1186/1758-2946-2-5.

DOI:10.1186/1758-2946-2-5

PMID:20591161

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2909924/

Abstract

BACKGROUND

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data.

RESULTS

We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services.

CONCLUSIONS

Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.

摘要

背景

QSAR 是一种广泛用于将化学结构与基于实验观察的响应或性质相关联的方法。人们已经付出了很大的努力来评估和验证 QSAR 中的统计建模，但这些分析将数据集视为固定的。一个被忽视但非常重要的问题是数据集设置的验证，这包括在计算之前添加化学结构以及选择描述符和软件实现。由于该领域缺乏标准和交换格式，使得分析和验证的重现以及分析和数据的极大限制合作和重用几乎成为不可能。

结果

我们通过定义可互操作和可重复使用的 QSAR 数据集，朝着 QSAR 分析的标准化迈出了一步，该数据集由一个开放的 XML 格式（QSAR-ML）组成，该格式基于开放和可扩展的描述符本体论。该本体论为在 QSAR 实验中使用描述符提供了一种可扩展的方法，并且交换格式支持这些描述符的多个版本的实现。因此，由 QSAR-ML 描述的数据集使其设置完全可重现。我们还提供了一个 Bioclipse 的参考实现，作为一组插件，它简化了 QSAR 数据集的设置，并允许以 QSAR-ML 以及老式的 CSV 格式导出。该实现便于从本地安装的软件和远程 Web 服务添加新的描述符实现；后者通过 REST 和 XMPP Web 服务进行演示。

结论

标准化的 QSAR 数据集为随后的分析开辟了存储、查询和交换数据的新途径。QSAR-ML 支持完全可重现的数据集创建，解决了定义使用的软件组件及其版本的问题，并且描述符本体论通过清晰地定义描述符消除了描述符的混淆。这使得加入、扩展、组合数据集并进行集体工作变得容易，也允许分析描述符对统计模型性能的影响。所提出的 Bioclipse 插件为科学家提供了图形工具，使社区更容易访问 QSAR-ML。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/afa7/2909924/a183951bdca2/1758-2946-2-5-1.jpg

相似文献

Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

J Cheminform. 2010 Jun 30;2(1):5. doi: 10.1186/1758-2946-2-5.

Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods.

J Cheminform. 2015 Aug 19;7:39. doi: 10.1186/s13321-015-0092-4. eCollection 2015.

AMBIT RESTful web services: an implementation of the OpenTox application programming interface.

J Cheminform. 2011 May 16;3:18. doi: 10.1186/1758-2946-3-18.

Bioclipse 2: a scriptable integration platform for the life sciences.

BMC Bioinformatics. 2009 Dec 3;10:397. doi: 10.1186/1471-2105-10-397.

QSAR Modeling is not "Push a Button and Find a Correlation": A Case Study of Toxicity of (Benzo-)triazoles on Algae.

Mol Inform. 2012 Dec;31(11-12):817-35. doi: 10.1002/minf.201200075. Epub 2012 Nov 19.

QSARINS-chem: Insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS.

J Comput Chem. 2014 May 15;35(13):1036-44. doi: 10.1002/jcc.23576. Epub 2014 Mar 5.

QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations.

J Cheminform. 2017 Jun 7;9(1):35. doi: 10.1186/s13321-017-0211-5.

AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment.

J Cheminform. 2011 Jul 28;3:28. doi: 10.1186/1758-2946-3-28.

QSAR modeling of datasets with enantioselective compounds using chirality sensitive molecular descriptors.

SAR QSAR Environ Res. 2005 Feb-Apr;16(1-2):93-102. doi: 10.1080/10629360412331319844.

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.

J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.

引用本文的文献

Revisiting the Use of Quantum Chemical Calculations in LogP Prediction.

Molecules. 2023 Jan 13;28(2):801. doi: 10.3390/molecules28020801.

Flame: an open source framework for model development, hosting, and usage in production environments.

J Cheminform. 2021 Apr 19;13(1):31. doi: 10.1186/s13321-021-00509-z.

Towards reproducible computational drug discovery.

J Cheminform. 2020 Jan 28;12(1):9. doi: 10.1186/s13321-020-0408-x.

Many InChIs and quite some feat.

J Comput Aided Mol Des. 2015 Aug;29(8):681-94. doi: 10.1007/s10822-015-9854-3. Epub 2015 Jun 17.

Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets.

J Chem Inf Model. 2015 Jun 22;55(6):1231-45. doi: 10.1021/acs.jcim.5b00143. Epub 2015 Jun 3.

Bigger data, collaborative tools and the future of predictive drug discovery.

J Comput Aided Mol Des. 2014 Oct;28(10):997-1008. doi: 10.1007/s10822-014-9762-y. Epub 2014 Jun 19.

QSAR DataBank - an approach for the digital organization and archiving of QSAR model information.

J Cheminform. 2014 May 14;6:25. doi: 10.1186/1758-2946-6-25. eCollection 2014.

Fusing dual-event data sets for Mycobacterium tuberculosis machine learning models and their evaluation.

J Chem Inf Model. 2013 Nov 25;53(11):3054-63. doi: 10.1021/ci400480s. Epub 2013 Oct 30.

Using Pareto points for model identification in predictive toxicology.

J Cheminform. 2013 Mar 22;5(1):16. doi: 10.1186/1758-2946-5-16.

Redefining Cheminformatics with Intuitive Collaborative Mobile Apps.

Mol Inform. 2012 Aug;31(8):569-584. doi: 10.1002/minf.201200010. Epub 2012 Jul 4.

本文引用的文献

Collaborative development of predictive toxicology applications.

J Cheminform. 2010 Aug 31;2(1):7. doi: 10.1186/1758-2946-2-7.

Bioclipse 2: a scriptable integration platform for the life sciences.

BMC Bioinformatics. 2009 Dec 3;10:397. doi: 10.1186/1471-2105-10-397.

XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services.

BMC Bioinformatics. 2009 Sep 4;10:279. doi: 10.1186/1471-2105-10-279.

The C1C2: a framework for simultaneous model selection and assessment.

BMC Bioinformatics. 2008 Sep 2;9:360. doi: 10.1186/1471-2105-9-360.

Exploiting QSAR models in lead optimization.

Curr Opin Drug Discov Devel. 2008 Jul;11(4):569-75.

Toward a class-independent quantitative structure--activity relationship model for uncouplers of oxidative phosphorylation.

Chem Res Toxicol. 2008 Apr;21(4):911-27. doi: 10.1021/tx700391f. Epub 2008 Mar 22.

Quantitative structure-carcinogenicity relationship for detecting structural alerts in nitroso compounds: species, rat; sex, female; route of administration, gavage.

Chem Res Toxicol. 2008 Mar;21(3):633-42. doi: 10.1021/tx700336n. Epub 2008 Feb 23.

Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays.

J Comput Aided Mol Des. 2008 Jun-Jul;22(6-7):367-84. doi: 10.1007/s10822-008-9192-9. Epub 2008 Feb 19.

Web service infrastructure for chemoinformatics.

J Chem Inf Model. 2007 Jul-Aug;47(4):1303-7. doi: 10.1021/ci6004349. Epub 2007 Jun 29.

A computational model for the prediction of aqueous solubility that includes crystal packing, intrinsic solubility, and ionization effects.

Mol Pharm. 2007 Jul-Aug;4(4):513-23. doi: 10.1021/mp070030+. Epub 2007 Jun 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

实现可互操作和可重现的定量构效关系分析：数据集的交换。

Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

机构信息

Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.