文本挖掘有效地对文献进行评分和排序，以提高比较毒理学基因组学数据库中的化学物质-基因-疾病的编纂工作。

Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

机构信息

Department of Biology, North Carolina State University, Raleigh, North Carolina, United States of America.

出版信息

PLoS One. 2013 Apr 17;8(4):e58201. doi: 10.1371/journal.pone.0058201. Print 2013.

DOI:10.1371/journal.pone.0058201

PMID:23613709

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3629079/

Abstract

The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

摘要

比较毒理学基因组学数据库（CTD；http://ctdbase.org/）是一个公共资源，它整理环境化学物质与基因产物之间的相互作用及其与疾病的关系，以此来了解环境化学物质对人类健康的影响。CTD 以化学物质-基因、化学物质-疾病和基因-疾病相互作用的形式提供了核心信息的三元组，这些信息是从科学文章中手动整理的。为了提高手动整理的效率、生产力和数据覆盖范围，我们利用文本挖掘来帮助对分类文献进行排名和优先级排序。在这里，我们描述了我们的文本挖掘过程，该过程计算并为每篇文章分配一个文档相关性评分（DRS），其中 DRS 较高表明该文章更有可能与 CTD 的整理相关。我们首先对 14904 篇针对七种重金属（镉、钴、铜、铅、锰、汞和镍）进行分类的文章进行了文本挖掘，从而对我们的过程进行了评估。根据初步分析，从 14094 篇文章中选择了一个具有代表性的 3583 篇文章子集，并将其发送给五名 CTD 生物编纂者进行审查。对这 3583 篇文章的整理结果进行了各种参数的分析，包括文章相关性、新颖数据内容、相互作用产率、平均精度、生物学和毒理学可解释性。我们表明，对于所有测量的参数，DRS 是一种有效的指标，可以对文献进行评分和排名，从而提高 CTD 化学物质-基因-疾病信息整理的排名。在这里，我们展示了如何将基于文本挖掘的 DRS 评分完全纳入我们的编纂过程，通过优先考虑更相关的文章来增强手动编纂，从而增加数据内容、生产力和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/75fa/3629079/0fff7dadfd9b/pone.0058201.g001.jpg

相似文献

Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.文本挖掘有效地对文献进行评分和排序，以提高比较毒理学基因组学数据库中的化学物质-基因-疾病的编纂工作。

PLoS One. 2013 Apr 17;8(4):e58201. doi: 10.1371/journal.pone.0058201. Print 2013.

Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database.靶向期刊策展作为一种提高比较毒理学基因组学数据库数据时效性的方法。

Database (Oxford). 2012 Dec 6;2012:bas051. doi: 10.1093/database/bas051. Print 2012.

Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD).文本挖掘和化学-基因-疾病网络的人工整理用于比较毒理学基因组数据库（CTD）。

BMC Bioinformatics. 2009 Oct 8;10:326. doi: 10.1186/1471-2105-10-326.

A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.CTD-Pfizer 合作项目：对 88000 篇经文本挖掘的科学文章进行人工注释，以发现药物-疾病和药物-表型相互作用。

Database (Oxford). 2013 Nov 28;2013:bat080. doi: 10.1093/database/bat080. Print 2013.

The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database.比较毒理学基因组学数据库中用于科学文献人工注释的注释范例和应用工具。

Database (Oxford). 2011 Sep 20;2011:bar034. doi: 10.1093/database/bar034. Print 2011.

Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.基于网络服务的文本挖掘对互操作性和流程简化具有广泛影响。

Database (Oxford). 2014 Jun 10;2014. doi: 10.1093/database/bau050. Print 2014.

Collaborative biocuration--text-mining development task for document prioritization for curation.协作生物注释——用于文档优先级排序的文本挖掘开发任务，以便进行注释。

Database (Oxford). 2012 Nov 22;2012:bas037. doi: 10.1093/database/bas037. Print 2012.

Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks.比较毒理基因组学数据库：一个关于化学物质-基因-疾病网络的知识库和发现工具。

Nucleic Acids Res. 2009 Jan;37(Database issue):D786-92. doi: 10.1093/nar/gkn580. Epub 2008 Sep 9.

Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database.使用二进制分类对比较毒理学基因组学数据库中的文章进行优先级排序和精选。

Database (Oxford). 2012 Dec 5;2012:bas050. doi: 10.1093/database/bas050. Print 2012.

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information.利用语义信息为比较毒理学基因组数据库对 PubMed 文章进行优先级排序。

Database (Oxford). 2012 Nov 17;2012:bas042. doi: 10.1093/database/bas042. Print 2012.

引用本文的文献

Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database.将来自PubTator的人工智能文本挖掘技术整合到比较毒理基因组学数据库的人工编目工作流程中。

Database (Oxford). 2025 Feb 21;2025. doi: 10.1093/database/baaf013.

Comparative Toxicogenomics Database's 20th anniversary: update 2025.比较毒理基因组学数据库成立20周年：2025年更新

Nucleic Acids Res. 2025 Jan 6;53(D1):D1328-D1334. doi: 10.1093/nar/gkae883.

Supporting the working life exposome: Annotating occupational exposure for enhanced literature search.支持工作生活外显子组：对职业暴露进行注释以增强文献检索。

PLoS One. 2024 Aug 15;19(8):e0307844. doi: 10.1371/journal.pone.0307844. eCollection 2024.

Identification of potential molecular mechanisms and therapeutic targets for recurrent pelvic organ prolapse.复发性盆腔器官脱垂潜在分子机制及治疗靶点的鉴定

Heliyon. 2023 Aug 27;9(9):e19440. doi: 10.1016/j.heliyon.2023.e19440. eCollection 2023 Sep.

A lncRNA-disease association prediction tool development based on bridge heterogeneous information network via graph representation learning for family medicine and primary care.一种基于桥梁异构信息网络并通过图表示学习的用于家庭医学和初级保健的lncRNA-疾病关联预测工具开发。

Front Genet. 2023 May 18;14:1084482. doi: 10.3389/fgene.2023.1084482. eCollection 2023.

Endocrine Disrupting Chemicals Influence Hub Genes Associated with Aggressive Prostate Cancer.内分泌干扰化学物质影响与侵袭性前列腺癌相关的枢纽基因。

Int J Mol Sci. 2023 Feb 6;24(4):3191. doi: 10.3390/ijms24043191.

Comparative Toxicogenomics Database (CTD): update 2023.比较毒理学基因组数据库（CTD）：2023 年更新。

Nucleic Acids Res. 2023 Jan 6;51(D1):D1257-D1262. doi: 10.1093/nar/gkac833.

In silico functional and pathway analysis of risk genes and SNPs for type 2 diabetes in Asian population.基于全基因组关联分析的中国汉族人群 2 型糖尿病风险基因和单核苷酸多态性的功能和通路分析

PLoS One. 2022 Aug 29;17(8):e0268826. doi: 10.1371/journal.pone.0268826. eCollection 2022.

A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome.自然语言处理在职业外核组学中的应用的叙事文献综述。

Int J Environ Res Public Health. 2022 Jul 13;19(14):8544. doi: 10.3390/ijerph19148544.

The expression patterns and prognostic significance of pleckstrin homology-like domain family A (PHLDA) in lung cancer and malignant mesothelioma.肺癌和恶性间皮瘤中普列克底物蛋白同源样结构域家族A（PHLDA）的表达模式及预后意义

J Thorac Dis. 2021 Feb;13(2):689-707. doi: 10.21037/jtd-20-2909.

本文引用的文献

Database (Oxford). 2012 Dec 6;2012:bas051. doi: 10.1093/database/bas051. Print 2012.

Collaborative biocuration--text-mining development task for document prioritization for curation.协作生物注释——用于文档优先级排序的文本挖掘开发任务，以便进行注释。

Database (Oxford). 2012 Nov 22;2012:bas037. doi: 10.1093/database/bas037. Print 2012.

The Comparative Toxicogenomics Database: update 2013.比较毒理学基因组学数据库：2013 年更新。

Nucleic Acids Res. 2013 Jan;41(Database issue):D1104-14. doi: 10.1093/nar/gks994. Epub 2012 Oct 23.

Text mining for the biocuration workflow.文本挖掘在生物注释工作流中的应用。

Database (Oxford). 2012 Apr 18;2012:bas020. doi: 10.1093/database/bas020. Print 2012.

MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database.医学：比较毒理学基因组学数据库中使用的实用疾病词汇。

Database (Oxford). 2012 Mar 20;2012:bar065. doi: 10.1093/database/bar065. Print 2012.

Biocurators and biocuration: surveying the 21st century challenges.生物注释员和生物注释：调查 21 世纪的挑战。

Database (Oxford). 2012 Mar 20;2012:bar059. doi: 10.1093/database/bar059. Print 2012.

Automatic categorization of diverse experimental information in the bioscience literature.生物科学文献中多样化实验信息的自动分类。

BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16.

Database resources of the National Center for Biotechnology Information.国家生物技术信息中心数据库资源。

Nucleic Acids Res. 2012 Jan;40(Database issue):D13-25. doi: 10.1093/nar/gkr1184. Epub 2011 Dec 2.

WormBase 2012: more genomes, more data, new website.2012 年的 WormBase：更多的基因组、更多的数据、全新的网站。

Nucleic Acids Res. 2012 Jan;40(Database issue):D735-41. doi: 10.1093/nar/gkr954. Epub 2011 Nov 8.

Database (Oxford). 2011 Sep 20;2011:bar034. doi: 10.1093/database/bar034. Print 2011.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

文本挖掘有效地对文献进行评分和排序，以提高比较毒理学基因组学数据库中的化学物质-基因-疾病的编纂工作。

Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献