Wiegers Thomas C, Davis Allan Peter, Wiegers Jolene, Sciaky Daniela, Barkalow Fern, Wyatt Brent, Strong Melissa, McMorran Roy, Abrar Sakib, Mattingly Carolyn J
Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA.
Center for Human Health and the Environment, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA.
Database (Oxford). 2025 Feb 21;2025. doi: 10.1093/database/baaf013.
The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.
比较毒理基因组学数据库(CTD)是一个人工整理的知识与发现库,旨在增进对环境暴露与人类健康之间关系的理解。CTD的人工整理过程从生物医学文献中提取化学物质/药物、基因/蛋白质、表型、疾病、解剖学术语和物种之间的分子关系。这些关系以高度系统的方式组织起来,使其不仅具有信息性,而且在科学上具有可计算性,从而能够形成推理假设以填补理解上的空白。CTD功能的一个组成部分是使用结构化的、分层的本体和受控词汇表来描述这些分子关系。对于生物编目人员来说,将文本标准化(即将文献中的原始文本翻译成这些受控词汇表)可能是一个耗时的过程。为了促进标准化过程并提高我们的科学家整理文献的效率,CTD评估了PubTator 3.0并将其整合到整理过程中,PubTator 3.0是一种先进的、由人工智能驱动的资源,它从文献中提取并标准化CTD整理的许多关键生物医学概念。在这里,我们描述了CTD在自然语言处理(NLP)方面的悠久历史,这段历史如何帮助我们形成NLP整合的目标,根据我们的目标对PubTator进行评估,以及将PubTator整合到CTD的整理工作流程中。数据库网址:https://ctdbase.org。
Nucleic Acids Res. 2025-1-6
Database (Oxford). 2012-3-20
Database (Oxford). 2014-6-10
Database (Oxford). 2012-12-6
BMC Bioinformatics. 2009-10-8
Nucleic Acids Res. 2025-1-6
Nucleic Acids Res. 2024-7-5
Bioinformatics. 2023-5-4
AMIA Annu Symp Proc. 2022
Nucleic Acids Res. 2019-7-2
Environ Health Perspect. 2018-1-18
Nucleic Acids Res. 2017-5-19