Zhu Dongqing, Wu Stephen, Carterette Ben, Liu Hongfang
Department of Computer and Information Sciences, University of Delaware, 440 Smith Hall, Newark, DE 19716, USA.
Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA.
J Biomed Inform. 2014 Jun;49:275-81. doi: 10.1016/j.jbi.2014.03.010. Epub 2014 Mar 26.
In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.
鉴于临床文本中存在的一词多义、同义以及上下义关系等突出问题,我们推测通过使用大型的领域内临床语料库进行查询扩展,可以改善患者队列识别。我们评估了四个辅助集合对于基于信息检索的队列检索的文本检索会议任务的效用,考虑了集合大小、查询的固有难度以及各集合之间的相互作用。通过使用相关模型的混合,每个集合都被应用于协助从匹兹堡自然语言处理知识库中进行队列检索。以平均准确率均值衡量,使用任何辅助资源的性能(平均准确率均值 = 0.386及以上)均显示优于基线查询似然模型(平均准确率均值 = 0.373)。考虑梅奥诊所集合的子集,我们发现纳入25亿个词元实例后,增加更多实例并不能提高检索效果。然而,添加梅奥诊所集合确实比任何现有设置都显著提高了性能,使用所有四个辅助集合的系统取得了最佳结果(平均准确率均值 = 0.4223)。因为在相关模型的混合中获得最优结果需要对集合进行选择性采样,所以“使用所有可用数据”这种常识性方法并不合适。然而,我们发现将梅奥语料库添加到任何相关模型的混合中仍然是有益的。在基于信息检索的队列识别任务中,使用梅奥诊所语料库进行查询扩展带来了持续且显著的改进。因此,任何能够访问大型临床语料库的信息检索查询扩展都可以从这一额外资源中受益。此外,我们已经表明,并非数据越多越好,这意味着语料库的整理是有价值的。