LIM & Bio EA 3969, Université Paris XIII, Sorbonne Paris Cité, 93017 Bobigny, France.
BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S11. doi: 10.1186/1471-2105-13-S14-S11. Epub 2012 Sep 7.
The Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool.
In addition to query normalizations and exact phonetic term matching, we tested two approximate string comparators: the similarity score function of Stoilos and the normalized Levenshtein edit distance. We propose here to combine them to increase the number of matched medical terms in French. We first took a sample of query logs to determine the thresholds and processing times. In the second run, at a greater scale we tested different combinations of query normalizations before or after misspelling correction with the retained thresholds in the first run.
According to the total number of suggestions (around 163, the number of the first sample of queries), at a threshold comparator score of 0.3, the normalized Levenshtein edit distance gave the highest F-Measure (88.15%) and at a threshold comparator score of 0.7, the Stoilos function gave the highest F-Measure (84.31%). By combining Levenshtein and Stoilos, the highest F-Measure (80.28%) is obtained with 0.2 and 0.7 thresholds respectively. However, queries are composed by several terms that may be combination of medical terms. The process of query normalization and segmentation is thus required. The highest F-Measure (64.18%) is obtained when this process is realized before spelling-correction.
Despite the widely known high performance of the normalized edit distance of Levenshtein, we show in this paper that its combination with the Stoilos algorithm improved the results for misspelling correction of user queries. Accuracy is improved by combining spelling, phoneme-based information and string normalizations and segmentations into medical terms. These encouraging results have enabled the integration of this method into two projects funded by the French National Research Agency-Technologies for Health Care. The first aims to facilitate the coding process of clinical free texts contained in Electronic Health Records and discharge summaries, whereas the second aims at improving information retrieval through Electronic Health Records.
互联网是一个主要的健康信息来源,但大多数搜索者并不熟悉医学词汇。因此,由于查询词的不当构成,他们的搜索失败了。已经提出了几种改进信息检索的方法:查询扩展、语法和语义技术或基于知识的方法。然而,清理那些拼写错误的查询是很有用的。在本文中,我们提出了一种简单而有效的方法,以便纠正医疗在线搜索工具中健康信息搜索者提交的查询中的拼写错误。
除了查询规范化和精确语音术语匹配外,我们还测试了两种近似字符串比较器:Stoilos 的相似度得分函数和标准化 Levenshtein 编辑距离。我们建议在这里将它们结合起来以增加法语中匹配的医学术语数量。我们首先从查询日志中抽取了一个样本,以确定阈值和处理时间。在第二次运行中,在更大的规模上,我们测试了在第一次运行中保留的阈值之前或之后的拼写错误校正的不同查询规范化组合。
根据建议的总数(约 163 个,第一个查询样本的数量),在比较器得分阈值为 0.3 时,标准化 Levenshtein 编辑距离的 F-Measure 最高(88.15%),在比较器得分阈值为 0.7 时,Stoilos 函数的 F-Measure 最高(84.31%)。通过结合 Levenshtein 和 Stoilos,当阈值分别为 0.2 和 0.7 时,获得的 F-Measure 最高(80.28%)。然而,查询由多个可能是医学术语组合的术语组成。因此,需要进行查询规范化和分割的过程。当这个过程在拼写纠正之前实现时,获得的 F-Measure 最高(64.18%)。
尽管众所周知的 Levenshtein 标准化编辑距离具有很高的性能,但在本文中我们表明,它与 Stoilos 算法的结合提高了用户查询拼写错误纠正的结果。通过将拼写、基于音素的信息和字符串规范化和分割组合成医学术语,可以提高准确性。这些令人鼓舞的结果使我们能够将这种方法集成到两个由法国国家研究署-医疗保健技术资助的项目中。第一个旨在简化电子健康记录和出院摘要中包含的临床自由文本的编码过程,第二个旨在通过电子健康记录改善信息检索。