Marshfield Clinic Research Foundation, Biomedical Informatics Research Center, Marshfield, WI, United States.
JMIR Med Inform. 2014 Nov 4;2(2):e30. doi: 10.2196/medinform.3463.
A search engine to find physicians' information is a basic but crucial function of a health care provider's website. Inefficient search engines, which return no results or incorrect results, can lead to patient frustration and potential customer loss. A search engine that can handle misspellings and spelling variations of names is needed, as the United States (US) has culturally, racially, and ethnically diverse names.
The Marshfield Clinic website provides a search engine for users to search for physicians' names. The current search engine provides an auto-completion function, but it requires an exact match. We observed that 26% of all searches yielded no results. The goal was to design a fuzzy-match algorithm to aid users in finding physicians easier and faster.
Instead of an exact match search, we used a fuzzy algorithm to find similar matches for searched terms. In the algorithm, we solved three types of search engine failures: "Typographic", "Phonetic spelling variation", and "Nickname". To solve these mismatches, we used a customized Levenshtein distance calculation that incorporated Soundex coding and a lookup table of nicknames derived from US census data.
Using the "Challenge Data Set of Marshfield Physician Names," we evaluated the accuracy of fuzzy-match engine-top ten (90%) and compared it with exact match (0%), Soundex (24%), Levenshtein distance (59%), and fuzzy-match engine-top one (71%).
We designed, created a reference implementation, and evaluated a fuzzy-match search engine for physician directories. The open-source code is available at the codeplex website and a reference implementation is available for demonstration at the datamarsh website.
搜索引擎是医疗服务提供商网站的基本但至关重要的功能,用于查找医师信息。如果搜索引擎无法返回结果或返回错误结果,会导致患者不满并可能导致潜在客户流失。由于美国的姓名在文化、种族和民族方面具有多样性,因此需要一种能够处理拼写错误和拼写变体的搜索引擎。
Marshfield 诊所的网站提供了一个搜索引擎,供用户搜索医师姓名。当前的搜索引擎提供自动补全功能,但需要完全匹配。我们观察到,26%的搜索结果都没有返回。目标是设计一种模糊匹配算法,以帮助用户更轻松、更快速地找到医师。
我们使用模糊算法而不是精确匹配搜索来查找搜索词的相似匹配。在该算法中,我们解决了三种类型的搜索引擎故障:“打字错误”、“语音拼写变体”和“昵称”。为了解决这些不匹配的问题,我们使用了一种定制的 Levenshtein 距离计算方法,该方法结合了 Soundex 编码和从美国人口普查数据中派生的昵称查找表。
使用“Marshfield 医师姓名挑战数据集”,我们评估了模糊匹配引擎前 10 名的准确性(90%),并将其与精确匹配(0%)、Soundex(24%)、Levenshtein 距离(59%)和模糊匹配引擎前 1 名(71%)进行了比较。
我们设计、创建了一个参考实现,并对医师名录的模糊匹配搜索引擎进行了评估。该开源代码可在 codeplex 网站上获得,参考实现可在 datamarsh 网站上演示。