PubExN：一个带有机构规范化程序包的自动化PubMed批量文章提取工具。

PubExN: An Automated PubMed Bulk Article Extractor with Affiliation Normalization Package.

作者信息

Kumar Ashutosh, Sharaff Aakanksha

机构信息

Department of Computer Science and Engineering, National Institute of Technology Raipur, G. E. Road, Raipur, 492001 Chhattisgarh India.

出版信息

SN Comput Sci. 2023;4(4):353. doi: 10.1007/s42979-023-01687-3. Epub 2023 Apr 26.

DOI:10.1007/s42979-023-01687-3

PMID:37128512

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10132428/

Abstract

Biomedical article extraction is the preliminary step for every biomedical application. These applications are helpful in finding the gene, disease, chemical, drugs, protein entities. Finding entities relation such as gene-gene entities, drug-disease interaction, and chemical protein relation the PubExN can be helpful for these types of biomedical applications. In most cases, domain experts do this extraction process on their own. Human interference makes this process time-consuming and there is a high probability, that documents can be missed during the extraction process. To get rid of these complicated processes a python package is introduced to automate the process of bulk extraction from the PubMed database. The extraction process covers all the citation information with the associated abstract. The batch approach is used to extract the bulk extraction. The motivation for the development of PubExN was to provide flexibility for the extraction process of biomedical article's text data from NCBI's PubMed database. Basically, NCBI's PubMed database article contains the article id or can say PubMed-id (PMID), the title of the article, abstract, authors information, etc. This package will benefit many biomedical texts mining research including biomedical named entity recognition, biomedical relation extraction, literature discovery, knowledgebase creation, and various biomedical Natural Language Processing (NLP) tasks. In addition, it could be used in the author name disambiguation problems and new drug discoveries. This package will help save time and extra effort for the extraction and normalization process of PubMed articles.

摘要

生物医学文章提取是每个生物医学应用的初步步骤。这些应用有助于找到基因、疾病、化学物质、药物、蛋白质实体。在寻找实体关系，如基因-基因实体、药物-疾病相互作用以及化学物质-蛋白质关系时，PubExN对这类生物医学应用可能会有帮助。在大多数情况下，领域专家会自行进行这个提取过程。人工干预使得这个过程很耗时，而且在提取过程中很有可能会遗漏文档。为了摆脱这些复杂的过程，引入了一个Python包来自动化从PubMed数据库中进行批量提取的过程。提取过程涵盖了所有带有相关摘要的引用信息。采用批处理方法进行批量提取。开发PubExN的动机是为从NCBI的PubMed数据库中提取生物医学文章的文本数据提供灵活性。基本上，NCBI的PubMed数据库文章包含文章ID，或者可以说是PubMed-ID（PMID）、文章标题、摘要、作者信息等。这个包将有益于许多生物医学文本挖掘研究，包括生物医学命名实体识别、生物医学关系提取、文献发现、知识库创建以及各种生物医学自然语言处理（NLP）任务。此外，它还可用于作者姓名消歧问题和新药发现。这个包将有助于节省PubMed文章提取和规范化过程中的时间和额外精力。