Suppr超能文献

关于用于临床文本挖掘的多语言语料库的构建

On the Construction of Multilingual Corpora for Clinical Text Mining.

作者信息

Villena Fabián, Eisenmann Urs, Knaup Petra, Dunstan Jocelyn, Ganzinger Matthias

机构信息

Institute of Medical Biometry and Informatics, Heidelberg University, Germany.

Center of Medical Informatics and Telemedicine, University of Chile, Chile.

出版信息

Stud Health Technol Inform. 2020 Jun 16;270:347-351. doi: 10.3233/SHTI200180.

Abstract

The amount of digital data derived from healthcare processes have increased tremendously in the last years. This applies especially to unstructured data, which are often hard to analyze due to the lack of available tools to process and extract information. Natural language processing is often used in medicine, but the majority of tools used by researchers are developed primarily for the English language. For developing and testing natural language processing methods, it is important to have a suitable corpus, specific to the medical domain that covers the intended target language. To improve the potential of natural language processing research, we developed tools to derive language specific medical corpora from publicly available text sources. n order to extract medicine-specific unstructured text data, openly available pub-lications from biomedical journals were used in a four-step process: (1) medical journal databases were scraped to download the articles, (2) the articles were parsed and consolidated into a single repository, (3) the content of the repository was de-scribed, and (4) the text data and the codes were released. In total, 93 969 articles were retrieved, with a word count of 83 868 501 in three different languages (German, English, and Spanish) from two medical journal databases Our results show that unstructured text data extraction from openly available medical journal databases for the construction of unified corpora of medical text data can be achieved through web scraping techniques.

摘要

在过去几年中,源自医疗保健流程的数字数据量大幅增加。这尤其适用于非结构化数据,由于缺乏处理和提取信息的可用工具,这些数据往往难以分析。自然语言处理在医学中经常被使用,但研究人员使用的大多数工具主要是为英语开发的。对于开发和测试自然语言处理方法,拥有一个适合医学领域、涵盖目标语言的合适语料库非常重要。为了提高自然语言处理研究的潜力,我们开发了从公开可用文本来源派生特定语言医学语料库的工具。为了提取特定于医学的非结构化文本数据,我们采用了四步流程,使用生物医学期刊的公开可用出版物:(1)抓取医学期刊数据库以下载文章,(2)解析文章并将其整合到一个存储库中,(3)描述存储库的内容,(4)发布文本数据和代码。总共从两个医学期刊数据库中检索了93969篇文章,三种不同语言(德语、英语和西班牙语)的单词计数为83868501。我们的结果表明,通过网络抓取技术,可以从公开可用的医学期刊数据库中提取非结构化文本数据,用于构建统一的医学文本数据语料库。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验