Suppr超能文献

使用众包的PubChem同义词筛选过程。

PubChem synonym filtering process using crowdsourcing.

作者信息

Kim Sunghwan, Yu Bo, Li Qingliang, Bolton Evan E

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

出版信息

J Cheminform. 2024 Jun 16;16(1):69. doi: 10.1186/s13321-024-00868-3.

Abstract

PubChem ( https://pubchem.ncbi.nlm.nih.gov ) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a "chemical synonym"). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem's crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem's filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.

摘要

PubChem(https://pubchem.ncbi.nlm.nih.gov)是一个公共化学信息资源库,包含超过1亿个独特的化学结构。在PubChem和其他化学数据库中,最常被请求的任务之一是按名称搜索化学物质(通常也称为“化学同义词”)。PubChem通过查找各个数据提交者提供给PubChem的化学同义词-结构关联来执行此任务。此外,这些同义词有多种用途,包括在化学物质和PubMed文章之间建立链接(使用医学主题词(MeSH)术语)。然而,这些由数据提交者提供的名称-结构关联在不同提交者之间以及同一提交者内部都存在很大差异,使得难以明确地将化学名称映射到特定的化学结构。本文描述了PubChem基于众包的同义词过滤策略,该策略解决了同义词-结构关联以及化学物质-MeSH关联中不同提交者之间和同一提交者内部的差异。PubChem同义词过滤过程是在分析四种众包投票策略的基础上开发的,这四种策略在采用的一致性阈值(60%对70%)以及在不同提交者之间进行众包投票之前如何解决同一提交者内部的差异(一票制对每个提交者多票制)方面有所不同。投票的一致性是在六个化学等效水平上确定的,这六个水平考虑了化学结构及其主要成分的不同同位素组成、立体化学和连接性。虽然所有四种策略都显示出可比的结果,但策略I(每个提交者一票,一致性阈值为60%)在六个化学等效情况下为单个化学结构分配的同义词最多,并且消除的同义词-结构关联也最多。基于这项研究的结果,策略I被应用于PubChem的过滤过程中,该过程清理同义词-结构关联以及化学物质-MeSH关联。这种基于一致性的过滤过程旨在寻找名称-结构关联中的共识,但无法证明其正确性。因此,例如当只有一个提交者提供同义词时,或者当许多贡献者都不正确时,它可能无法识别正确的名称-结构关联(或错误的关联)。然而,这种过滤过程是像PubChem这样的大型化学数据库中名称-结构关联质量控制的重要起点。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/378c/11181558/1d589ce38d5c/13321_2024_868_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验