Grulke Christopher M, Williams Antony J, Thillanadarajah Inthirany, Richard Ann M
National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mail Drop D143-02, Research Triangle Park, NC 27711, USA.
Senior Environmental Employment Program, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA.
Comput Toxicol. 2019 Nov 1;12. doi: 10.1016/j.comtox.2019.100096.
The US Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database, launched publicly in 2004, currently exceeds 875 K substances spanning hundreds of lists of interest to EPA and environmental researchers. From its inception, DSSTox has focused curation efforts on resolving chemical identifier errors and conflicts in the public domain towards the goal of assigning accurate chemical structures to data and lists of importance to the environmental research and regulatory community. Accurate structure-data associations, in turn, are necessary inputs to structure-based predictive models supporting hazard and risk assessments. In 2014, the legacy, manually curated DSSTox_V1 content was migrated to a MySQL data model, with modern cheminformatics tools supporting both manual and automated curation processes to increase efficiencies. This was followed by sequential auto-loads of filtered portions of three public datasets: EPA's Substance Registry Services (SRS), the National Library of Medicine's ChemID, and PubChem. This process was constrained by a key requirement of uniquely mapped identifiers (i.e., CAS RN, name and structure) for each substance, rejecting content where any two identifiers were conflicted either within or across datasets. This rejected content highlighted the degree of conflicting, inaccurate substance-structure ID mappings in the public domain, ranging from 12% (within EPA SRS) to 49% (across ChemID and PubChem). Substances successfully added to DSSTox from each auto-load were assigned to one of five , conveying curator confidence in each dataset. This process enabled a significant expansion of DSSTox content to provide better coverage of the chemical landscape of interest to environmental scientists, while retaining focus on the accuracy of substance-structure-data associations. Currently, DSSTox serves as the core foundation of EPA's CompTox Chemicals Dashboard [https://comptox.epa.gov/dashboard], which provides public access to DSSTox content in support of a broad range of modeling and research activities within EPA and, increasingly, across the field of computational toxicology.
美国环境保护局(EPA)的分布式结构可搜索毒性(DSSTox)数据库于2004年公开发布,目前包含超过87.5万种物质,涵盖了EPA和环境研究人员感兴趣的数百个列表。从一开始,DSSTox就将整理工作重点放在解决公共领域中的化学标识符错误和冲突上,目标是为对环境研究和监管界重要的数据和列表分配准确的化学结构。反过来,准确的结构-数据关联是支持危害和风险评估的基于结构的预测模型的必要输入。2014年,传统的手动整理的DSSTox_V1内容迁移到了MySQL数据模型,现代化学信息学工具支持手动和自动整理过程以提高效率。随后依次自动加载了三个公共数据集的过滤部分:EPA的物质注册服务(SRS)、美国国立医学图书馆的ChemID和PubChem。这个过程受到每个物质唯一映射标识符(即CAS RN、名称和结构)这一关键要求的限制,拒绝任何两个标识符在数据集内或跨数据集冲突的内容。这些被拒绝的内容突出了公共领域中物质-结构ID映射冲突和不准确的程度,范围从12%(在EPA SRS内)到49%(跨ChemID和PubChem)。从每次自动加载中成功添加到DSSTox的物质被分配到五个类别之一,传达了整理人员对每个数据集的信心。这个过程使DSSTox内容得到了显著扩展,以更好地覆盖环境科学家感兴趣的化学领域,同时保持对物质-结构-数据关联准确性的关注。目前,DSSTox是EPA的综合毒性化学品仪表板[https://comptox.epa.gov/dashboard]的核心基础,该仪表板向公众提供DSSTox内容,以支持EPA内部以及越来越多的计算毒理学领域的广泛建模和研究活动。