从科学摘要文本中提取母体化合物及其代谢物的数据。

Extraction of Data on Parent Compounds and Their Metabolites from Texts of Scientific Abstracts.

机构信息

Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow 119121, Russia.

出版信息

J Chem Inf Model. 2021 Apr 26;61(4):1683-1690. doi: 10.1021/acs.jcim.0c01054. Epub 2021 Mar 16.

DOI:10.1021/acs.jcim.0c01054

Abstract

The growing amount of experimental data on chemical objects includes properties of small molecules, results of studies of their interaction with human and animal proteins, and methods of synthesis of organic compounds (OCs). The data obtained can be used to identify the names of OCs automatically, including all possible synonyms and relevant data on the molecular properties and biological activity. Utilization of different synonymic names of chemical compounds allows researchers to increase the completeness of data on their properties available from publications. Enrichment of the data on the names of chemical compounds by information about their possible metabolites can help estimate the biological effects of parent compounds and their metabolites more thoroughly. Therefore, an attempt at automated extraction of the names of parent compounds and their metabolites from the texts is a rather important task. In our study, we aimed at developing a method that provides the extraction of the named entities (NEs) of parent compounds and their metabolites from abstracts of scientific publications. Based on the application of the conditional random fields' algorithm, we extracted the NEs of chemical compounds. We developed a set of rules allowing identification of parent compound NEs and their metabolites in the texts. We evaluated the possibility of extracting the names of potential metabolites based on cosine similarity between strings representing names of parent compounds and all other chemical NEs found in the text. Additionally, we used conditional random fields to fetch the names of parent compounds and their metabolites from the texts based on the corpus of texts labeled manually. Our computational experiments showed that usage of rules in combination with cosine similarity could increase the accuracy of recognition of the names of metabolites compared to the rule-based algorithm and application of a machine-learning algorithm (conditional random fields).

摘要

越来越多的化学物质实验数据包括小分子的性质、它们与人和动物蛋白质相互作用的研究结果以及有机化合物 (OC) 的合成方法。获得的数据可用于自动识别 OC 的名称，包括所有可能的同义词以及有关分子性质和生物活性的相关数据。利用化学化合物的不同同义词可以使研究人员增加出版物中可用的关于其性质的完整数据。通过有关其可能代谢物的信息丰富化学化合物名称的数据，可以帮助更全面地估计母体化合物及其代谢物的生物效应。因此，从文本中自动提取母体化合物及其代谢物名称的尝试是一项非常重要的任务。在我们的研究中，我们旨在开发一种从科学出版物摘要中提取母体化合物及其代谢物命名实体 (NE) 的方法。基于条件随机场算法的应用，我们提取了化学化合物的 NE。我们开发了一组规则，用于识别文本中母体化合物的 NE 和它们的代谢物。我们评估了基于表示母体化合物名称的字符串与在文本中找到的所有其他化学 NE 之间的余弦相似性来提取潜在代谢物名称的可能性。此外，我们还使用条件随机场根据手动标记的文本语料库从文本中提取母体化合物及其代谢物的名称。我们的计算实验表明，与基于规则的算法和应用机器学习算法（条件随机场）相比，使用规则结合余弦相似性可以提高代谢物名称识别的准确性。

相似文献

Extraction of Data on Parent Compounds and Their Metabolites from Texts of Scientific Abstracts.从科学摘要文本中提取母体化合物及其代谢物的数据。

J Chem Inf Model. 2021 Apr 26;61(4):1683-1690. doi: 10.1021/acs.jcim.0c01054. Epub 2021 Mar 16.

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.从科学出版物文本中自动提取信息：对HIV治疗策略的见解

Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach.使用朴素贝叶斯分类器方法在科学出版物文本中进行化学命名实体识别。

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

A hybrid named entity tagger for tagging human proteins/genes.一种用于标记人类蛋白质/基因的混合命名实体标记器。

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

A document processing pipeline for annotating chemical entities in scientific documents.用于在科学文献中标记化学实体的文档处理管道。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S7. doi: 10.1186/1758-2946-7-S1-S7. eCollection 2015.

Text mining in livestock animal science: introducing the potential of text mining to animal sciences.文本挖掘在畜牧动物科学中的应用：介绍文本挖掘在动物科学中的应用潜力。

J Anim Sci. 2012 Oct;90(10):3666-76. doi: 10.2527/jas.2011-4841. Epub 2012 Jun 4.

PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。

J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.

NetiNeti: discovery of scientific names from text using machine learning methods.内提内提：使用机器学习方法从文本中发现科学名称。

BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.

CHEMDNER system with mixed conditional random fields and multi-scale word clustering.CHEMDNER 系统，混合条件随机场和多尺度词聚类。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S4. doi: 10.1186/1758-2946-7-S1-S4. eCollection 2015.

Comparative experiments on learning information extractors for proteins and their interactions.蛋白质及其相互作用的学习信息提取器的比较实验。

Artif Intell Med. 2005 Feb;33(2):139-55. doi: 10.1016/j.artmed.2004.07.016.

引用本文的文献

XenoMet: A Corpus of Texts to Extract Data on Metabolites of Xenobiotics.XenoMet：用于提取异生物素代谢物数据的文本语料库。

ACS Omega. 2025 Jan 12;10(3):2459-2471. doi: 10.1021/acsomega.4c05723. eCollection 2025 Jan 28.

Identification of Molecular Mechanisms Involved in Viral Infection Progression Based on Text Mining: Case Study for HIV Infection.基于文本挖掘的病毒感染进展中涉及的分子机制鉴定：以 HIV 感染为例。

Int J Mol Sci. 2023 Jan 11;24(2):1465. doi: 10.3390/ijms24021465.

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从科学摘要文本中提取母体化合物及其代谢物的数据。

Extraction of Data on Parent Compounds and Their Metabolites from Texts of Scientific Abstracts.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献