Suppr超能文献

使用BioWordVec的基于相似度的无监督拼写校正:细菌培养和药敏报告的开发与可用性研究

Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports.

作者信息

Kim Taehyeong, Han Sung Won, Kang Minji, Lee Se Ha, Kim Jong-Ho, Joo Hyung Joon, Sohn Jang Wook

机构信息

Division of Industrial Management Engineering, Korea University, Seoul, Republic of Korea.

Information Computing Office, Korea University Anam Hospital, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2021 Feb 22;9(2):e25530. doi: 10.2196/25530.

Abstract

BACKGROUND

Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of a dictionary, traditional spelling correction algorithms that utilize only edit distances have limitations.

OBJECTIVE

In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams-based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place.

METHODS

For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking.

RESULTS

Bacterial identification words were extracted from 27,544 bacterial culture and antimicrobial susceptibility reports, and 16 types of spelling errors and 914 misspelled words were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 97.48% (based on F1 score) of all spelling errors.

CONCLUSIONS

This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in bacterial culture and antimicrobial susceptibility reports. This method will help build a high-quality refined database of vast text data for electronic health records.

摘要

背景

现有传染病细菌培养检测结果采用未精炼的文本形式书写,导致诸多问题,包括排版错误和停用词。为确保传染病研究数据(包括医学术语提取)的准确性和可靠性,需要有效的拼写纠正流程。若建立词典,使用编辑距离的拼写算法会很高效。然而,在没有词典的情况下,仅利用编辑距离的传统拼写纠正算法存在局限性。

目的

在本研究中,我们提出了一种基于相似度的拼写纠正算法,该算法使用预训练的词嵌入和BioWordVec技术。此方法通过无监督学习使用基于字符级N - 元语法的分布式表示,而非现有的基于规则的方法。换句话说,我们提出了一个在没有词典时检测和纠正排版错误的框架。

方法

对于未映射到医学系统命名法(SNOMED)临床术语的检测到排版错误,使用临床数据库中的预训练词嵌入生成考虑编辑距离的高相似度纠正候选组。从词汇按频率降序排列的嵌入矩阵中,使用网格搜索来搜索相似词的候选组。此后,考虑词的频率对纠正候选词进行排序,最后根据排序结果纠正排版错误。

结果

从27544份细菌培养和抗菌药敏报告中提取了细菌鉴定词,发现了16种拼写错误类型和914个拼写错误的词。本研究中提出的使用BioWordVec的基于相似度的拼写纠正算法纠正了12种排版错误,在纠正所有拼写错误的97.48%(基于F1分数)方面表现出非常高的性能。

结论

该工具在没有词典的情况下,基于细菌培养和抗菌药敏报告中的细菌鉴定词有效地纠正了拼写错误。此方法将有助于为电子健康记录构建高质量的精炼海量文本数据库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/87cf/7939936/b859d32fd12f/medinform_v9i2e25530_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验