从科学出版物文本中自动提取信息：对HIV治疗策略的见解

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.

作者信息

Biziukova Nadezhda, Tarasova Olga, Ivanov Sergey, Poroikov Vladimir

机构信息

Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia.

Department of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia.

出版信息

Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.

DOI:10.3389/fgene.2020.618862

PMID:33414815

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7783389/

Abstract

Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.

摘要

文本分析有助于识别小分子、蛋白质和基因等命名实体（NE）。此类数据对于分析疾病进展的分子机制以及开发各种疾病和病理状况的新治疗策略非常重要。出版物文本是信息的主要来源，与数据库相比，由于能直接获取信息，收集高质量数据尤为重要。在我们的研究中，我们旨在开发和测试一种在出版物摘要中识别命名实体的方法。更具体地说，我们开发并测试了一种基于条件随机场的算法，该算法可识别（i）基因和蛋白质以及（ii）化学物质的命名实体。仔细选择与感兴趣主题严格相关的摘要，使得提取与该主题紧密相关的命名实体成为可能。为了测试我们方法的适用性，我们将其应用于提取（i）潜在的HIV抑制剂以及（ii）一组可能负责HIV阳性患者病毒血症控制的蛋白质和基因。所进行的计算实验提供了评估化学命名实体和蛋白质（基因）识别准确性的估计值。化学命名实体识别的精确率超过0.91；召回率为0.86，F1分数（精确率和召回率的调和平均值）为0.89；蛋白质和基因名称识别的精确率超过0.86；召回率为0.83；而F1分数高于0.85。对与HIV治疗相关的两个案例研究的算法评估证实了我们的观点，即有可能提取与（i）HIV抑制剂以及（ii）一组患者密切相关的命名实体，即一组在无抗逆转录病毒治疗的情况下能够长期维持不可检测的HIV-1病毒载量的HIV阳性个体。对所得结果的分析为可能负责病毒血症控制的蛋白质功能提供了见解。我们的研究证明了所开发方法在提取HIV治疗有用数据方面的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fb35/7783389/2b5c7022aa58/fgene-11-618862-g0001.jpg

相似文献

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.从科学出版物文本中自动提取信息：对HIV治疗策略的见解

Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.

Extraction of Data on Parent Compounds and Their Metabolites from Texts of Scientific Abstracts.从科学摘要文本中提取母体化合物及其代谢物的数据。

J Chem Inf Model. 2021 Apr 26;61(4):1683-1690. doi: 10.1021/acs.jcim.0c01054. Epub 2021 Mar 16.

Text mining in livestock animal science: introducing the potential of text mining to animal sciences.文本挖掘在畜牧动物科学中的应用：介绍文本挖掘在动物科学中的应用潜力。

J Anim Sci. 2012 Oct;90(10):3666-76. doi: 10.2527/jas.2011-4841. Epub 2012 Jun 4.

Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach.使用朴素贝叶斯分类器方法在科学出版物文本中进行化学命名实体识别。

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database.迈向半自动化策展：使用文本挖掘技术重现 HIV-1 与人类蛋白质相互作用数据库。

Database (Oxford). 2012 Apr 23;2012:bas023. doi: 10.1093/database/bas023. Print 2012.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience.使用主动和深度学习的文本挖掘管道，旨在为计算神经科学中的信息提供支持。

Neuroinformatics. 2019 Jul;17(3):391-406. doi: 10.1007/s12021-018-9404-y.

Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach.基于本体的推特消息中医疗命名实体识别的递归神经网络方法。

Int J Environ Res Public Health. 2019 Sep 27;16(19):3628. doi: 10.3390/ijerph16193628.

A hybrid named entity tagger for tagging human proteins/genes.一种用于标记人类蛋白质/基因的混合命名实体标记器。

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition.BCC-NER：用于基因/蛋白质提及识别的双向上下文线索命名实体标记器。

EURASIP J Bioinform Syst Biol. 2017 Dec;2017(1):7. doi: 10.1186/s13637-017-0060-6. Epub 2017 May 5.

引用本文的文献

XenoMet: A Corpus of Texts to Extract Data on Metabolites of Xenobiotics.XenoMet：用于提取异生物素代谢物数据的文本语料库。

ACS Omega. 2025 Jan 12;10(3):2459-2471. doi: 10.1021/acsomega.4c05723. eCollection 2025 Jan 28.

Identification of Molecular Mechanisms Involved in Viral Infection Progression Based on Text Mining: Case Study for HIV Infection.基于文本挖掘的病毒感染进展中涉及的分子机制鉴定：以 HIV 感染为例。

Int J Mol Sci. 2023 Jan 11;24(2):1465. doi: 10.3390/ijms24021465.

J Cheminform. 2022 Aug 13;14(1):55. doi: 10.1186/s13321-022-00633-4.

本文引用的文献

COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics.COVID19 药物库：从文献中挖掘文本以寻找潜在的 COVID19 疗法。

Nucleic Acids Res. 2021 Jan 8;49(D1):D1113-D1121. doi: 10.1093/nar/gkaa969.

Human Immunodeficiency Virus 1 (HIV-1): Viral Latency, the Reservoir, and the Cure.人类免疫缺陷病毒 1（HIV-1）：病毒潜伏期、储存库和治愈方法。

Yale J Biol Med. 2020 Sep 30;93(4):549-560. eCollection 2020 Sep.

Named Entity Recognition and Relation Detection for Biomedical Information Extraction.用于生物医学信息提取的命名实体识别与关系检测

Front Cell Dev Biol. 2020 Aug 28;8:673. doi: 10.3389/fcell.2020.00673. eCollection 2020.

Data and Text Mining Help Identify Key Proteins Involved in the Molecular Mechanisms Shared by SARS-CoV-2 and HIV-1.数据和文本挖掘有助于确定 SARS-CoV-2 和 HIV-1 分子机制中共同涉及的关键蛋白。

Molecules. 2020 Jun 26;25(12):2944. doi: 10.3390/molecules25122944.

Biomedical named entity recognition and linking datasets: survey and our recent development.生物医学命名实体识别与链接数据集：综述及我们的最新进展

Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.

Rapid HIV Progression Is Associated with Extensive Ongoing Somatic Hypermutation.快速 HIV 进展与广泛持续的体细胞超突变有关。

J Immunol. 2020 Aug 1;205(3):587-594. doi: 10.4049/jimmunol.1901161. Epub 2020 Jun 26.

Broad-coverage biomedical relation extraction with SemRep.基于 SemRep 的广谱生物医学关系抽取。

BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.

Inositol and pulmonary function. Could myo-inositol treatment downregulate inflammation and cytokine release syndrome in SARS-CoV-2?肌醇与肺功能。肌醇治疗能否下调 SARS-CoV-2 中的炎症和细胞因子释放综合征？

Eur Rev Med Pharmacol Sci. 2020 Mar;24(6):3426-3432. doi: 10.26355/eurrev_202003_20715.

Polymorphisms of SOCS-1 Are Associated With a Rapid HIV Progression Rate.SOCS-1 多态性与 HIV 快速进展率相关。

J Acquir Immune Defic Syndr. 2020 Jun 1;84(2):189-195. doi: 10.1097/QAI.0000000000002319.

DTranNER: biomedical named entity recognition with deep learning-based label-label transition model.DTranNER：基于深度学习的标签-标签转换模型的生物医学命名实体识别。

BMC Bioinformatics. 2020 Feb 11;21(1):53. doi: 10.1186/s12859-020-3393-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从科学出版物文本中自动提取信息：对HIV治疗策略的见解

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献