Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
J Cheminform. 2010 Jun 30;2(1):5. doi: 10.1186/1758-2946-2-5.
QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data.
We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services.
Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
QSAR 是一种广泛用于将化学结构与基于实验观察的响应或性质相关联的方法。人们已经付出了很大的努力来评估和验证 QSAR 中的统计建模,但这些分析将数据集视为固定的。一个被忽视但非常重要的问题是数据集设置的验证,这包括在计算之前添加化学结构以及选择描述符和软件实现。由于该领域缺乏标准和交换格式,使得分析和验证的重现以及分析和数据的极大限制合作和重用几乎成为不可能。
我们通过定义可互操作和可重复使用的 QSAR 数据集,朝着 QSAR 分析的标准化迈出了一步,该数据集由一个开放的 XML 格式(QSAR-ML)组成,该格式基于开放和可扩展的描述符本体论。该本体论为在 QSAR 实验中使用描述符提供了一种可扩展的方法,并且交换格式支持这些描述符的多个版本的实现。因此,由 QSAR-ML 描述的数据集使其设置完全可重现。我们还提供了一个 Bioclipse 的参考实现,作为一组插件,它简化了 QSAR 数据集的设置,并允许以 QSAR-ML 以及老式的 CSV 格式导出。该实现便于从本地安装的软件和远程 Web 服务添加新的描述符实现;后者通过 REST 和 XMPP Web 服务进行演示。
标准化的 QSAR 数据集为随后的分析开辟了存储、查询和交换数据的新途径。QSAR-ML 支持完全可重现的数据集创建,解决了定义使用的软件组件及其版本的问题,并且描述符本体论通过清晰地定义描述符消除了描述符的混淆。这使得加入、扩展、组合数据集并进行集体工作变得容易,也允许分析描述符对统计模型性能的影响。所提出的 Bioclipse 插件为科学家提供了图形工具,使社区更容易访问 QSAR-ML。