Heppin Karin Friberg
NLP-Unit, Department of Swedish, University of Gothenburg, S-405 30 Gothenburg, Sweden.
J Biomed Semantics. 2011;2 Suppl 3(Suppl 3):S4. doi: 10.1186/2041-1480-2-S3-S4. Epub 2011 Jul 14.
Test collections for information retrieval are scarce. Domain specific test collections even more so, and medical test collections in the Swedish language non-existent prior to the making of the MedEval test collection. Most research in information retrieval has been performed in the English language, thus most test collections contain English documents. However, English is morphologically poor compared to many other European languages and a number of interesting and important aspects have not been investigated. Building a medical test collection in Swedish opens new research opportunities.
This article describes the making of and potential uses of MedEval, a Swedish medical test collection with assessments, not only for topical relevance, but also for target reader group: Doctors or Patients. A user of the test collection may choose if she wishes to search in the Doctors or the Patients scenario where the topical relevance assessments have been adjusted with consideration to user group, or to search in a scenario which regards only topical relevance.In addition to having three user groups, MedEval, in its present form, has two indexes, one where the terms are lemmatized and one where the terms are lemmatized and the compounds split and the constituents indexed together with the whole compound.
Differences discovered between the documents written for medical professionals and documents written for laypersons are presented. These differences may be utilized in further studies of retrieval of documents aimed at certain groups of readers. Differences between the groups of documents are, for example, that professional documents have a higher ratio of compounds, have a greater average word length and contain more multi-word expressions.An experiment is described where the user scenarios have been utilized, searching with expert terms and lay terms, separately and in combination in the different scenarios. The tendency discovered is that the medical expert gets best results using expert terms and the lay person best results using lay terms, but also quite good results using expert terms or lay and expert terms in combination.
The many features of MedEval gives a variety of research possibilities, such as comparing the effectiveness of search terms when it comes to retrieving documents aimed at the different user groups or to study the effect of compound decomposition in retrieval of documents. As Swedish, the language of MedEval, is a morphologically more complex language than English, it is possible to study additional aspects of the effect of natural language processing in information retrieval, for example utilizing different inflectional word forms in the retrieval of expert vs lay documents. MedEval is the first Swedish test collection of the medical domain.
The Department of Swedish at the University of Gothenburg is in the process of making the MedEval test collection available to academic researchers.
用于信息检索的测试集很少。特定领域的测试集更是如此,而在MedEval测试集创建之前,瑞典语的医学测试集并不存在。信息检索方面的大多数研究都是用英语进行的,因此大多数测试集都包含英文文档。然而,与许多其他欧洲语言相比,英语在形态学上较为贫乏,一些有趣且重要的方面尚未得到研究。构建一个瑞典语医学测试集开启了新的研究机会。
本文描述了MedEval的创建过程及其潜在用途,MedEval是一个瑞典语医学测试集,不仅对主题相关性进行评估,还对目标读者群体(医生或患者)进行评估。测试集的用户可以选择是希望在考虑用户群体后调整了主题相关性评估的医生或患者场景中进行搜索,还是在仅考虑主题相关性的场景中进行搜索。除了有三个用户群体外,MedEval目前的形式有两个索引,一个索引中的词是经过词形还原的,另一个索引中的词经过词形还原、复合词拆分,其组成部分与整个复合词一起索引。
展示了为医学专业人员撰写的文档和为非专业人员撰写的文档之间发现的差异。这些差异可用于针对特定读者群体的文档检索的进一步研究。文档组之间的差异例如在于,专业文档的复合词比例更高、平均单词长度更长且包含更多多词表达式。描述了一个实验,其中利用了用户场景,在不同场景中分别和组合使用专家术语和外行术语进行搜索。发现的趋势是,医学专家使用专家术语能获得最佳结果,外行人员使用外行术语能获得最佳结果,但使用专家术语或外行和专家术语组合也能获得相当好的结果。
MedEval的众多特性提供了多种研究可能性,例如在检索针对不同用户群体的文档时比较搜索词的有效性,或者研究复合词分解在文档检索中的效果。由于MedEval所使用的瑞典语在形态学上比英语更复杂,因此有可能研究信息检索中自然语言处理效果的其他方面,例如在检索专家文档与外行文档时利用不同的屈折词形。MedEval是医学领域首个瑞典语测试集。
哥德堡大学瑞典语系正在使MedEval测试集可供学术研究人员使用。