一种用于解决QSAR建模中使用的公共数据集中化学错误和不一致性的自动化编目程序。

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.

作者信息

Mansouri K, Grulke C M, Richard A M, Judson R S, Williams A J

机构信息

a Oak Ridge Institute for Science and Education (ORISE) , Oak Ridge , TN , USA.

b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA.

出版信息

SAR QSAR Environ Res. 2016 Nov;27(11):939-965. doi: 10.1080/1062936X.2016.1253611.

DOI:10.1080/1062936X.2016.1253611

PMID:27885862

Abstract

The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.

摘要

大量化学结构和相关实验数据的可得性不断提高，为构建适用于不同领域的稳健定量构效关系（QSAR）模型提供了契机。一个常见的问题是化学结构信息和相关实验数据的质量。在此，我们描述了一种自动化的KNIME工作流程，利用公开可用的PHYSPROP物理化学性质和环境归宿数据集，对化学品的结构和标识中的错误进行整理和纠正。该工作流程首先使用多达四个提供的化学标识符（包括化学名称、化学物质登记号、简化分子线性输入规范（SMILES）和分子块）组装结构-标识对。检测到的问题包括化学结构格式、标识符中的错误和不匹配以及各种结构验证问题，包括高价和立体化学描述。随后，应用机器学习程序来评估此整理过程的影响。将仅基于原始数据集的最高质量子集构建的QSAR模型的性能与经过整理和纠正的更大数据集进行了比较。后者在统计上显示出预测性能的提高。最终的工作流程用于整理PHYSPROP数据集的完整列表，并将公开提供给科学界以供进一步使用和整合。