Suppr超能文献

从科学出版物文本中自动提取信息:对HIV治疗策略的见解

Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.

作者信息

Biziukova Nadezhda, Tarasova Olga, Ivanov Sergey, Poroikov Vladimir

机构信息

Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia.

Department of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia.

出版信息

Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020.

Abstract

Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.

摘要

文本分析有助于识别小分子、蛋白质和基因等命名实体(NE)。此类数据对于分析疾病进展的分子机制以及开发各种疾病和病理状况的新治疗策略非常重要。出版物文本是信息的主要来源,与数据库相比,由于能直接获取信息,收集高质量数据尤为重要。在我们的研究中,我们旨在开发和测试一种在出版物摘要中识别命名实体的方法。更具体地说,我们开发并测试了一种基于条件随机场的算法,该算法可识别(i)基因和蛋白质以及(ii)化学物质的命名实体。仔细选择与感兴趣主题严格相关的摘要,使得提取与该主题紧密相关的命名实体成为可能。为了测试我们方法的适用性,我们将其应用于提取(i)潜在的HIV抑制剂以及(ii)一组可能负责HIV阳性患者病毒血症控制的蛋白质和基因。所进行的计算实验提供了评估化学命名实体和蛋白质(基因)识别准确性的估计值。化学命名实体识别的精确率超过0.91;召回率为0.86,F1分数(精确率和召回率的调和平均值)为0.89;蛋白质和基因名称识别的精确率超过0.86;召回率为0.83;而F1分数高于0.85。对与HIV治疗相关的两个案例研究的算法评估证实了我们的观点,即有可能提取与(i)HIV抑制剂以及(ii)一组患者密切相关的命名实体,即一组在无抗逆转录病毒治疗的情况下能够长期维持不可检测的HIV-1病毒载量的HIV阳性个体。对所得结果的分析为可能负责病毒血症控制的蛋白质功能提供了见解。我们的研究证明了所开发方法在提取HIV治疗有用数据方面的适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fb35/7783389/2b5c7022aa58/fgene-11-618862-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验