Soni Sarvesh, Roberts Kirk
School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston TX, USA.
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:626-635. eCollection 2020.
This paper describes a paraphrasing approach to improve the performance of question answering (QA) for electronic health records (EHRs). QA systems for structured EHR data usually rely on semantic parsing, which aims to generate machine-understandable logical forms from free-text questions. Training semantic parsers requires large datasets of question-logical form (QL) pairs, which are labor-intensive to create. Considering the scarcity of large QL datasets in the clinical domain, we propose a framework for expanding an existing dataset using paraphrasing. We experiment with different heuristics for multiple sample sizes and iterations to assess the effect of adding paraphrasing to the task of semantic parsing. We found that adding paraphrases to an existing dataset based on TERTHRESHOLD scores results in an improved performance in the majority (74%) of the experimental runs. Hence, the proposed paraphrasing-based framework has the potential to improve the performance of QA systems using a limited set of existing QL annotations.
本文描述了一种释义方法,以提高电子健康记录(EHR)问答(QA)的性能。用于结构化EHR数据的QA系统通常依赖语义解析,其目的是从自由文本问题生成机器可理解的逻辑形式。训练语义解析器需要大量的问题-逻辑形式(QL)对数据集,而创建这些数据集需要耗费大量人力。考虑到临床领域中大型QL数据集的稀缺性,我们提出了一个使用释义来扩展现有数据集的框架。我们针对多个样本大小和迭代试验了不同的启发式方法,以评估在语义解析任务中添加释义的效果。我们发现,基于TERTHRESHOLD分数向现有数据集添加释义会在大多数(74%)实验运行中提高性能。因此,所提出的基于释义的框架有可能使用有限的现有QL注释集来提高QA系统的性能。