Medical University of South Carolina, Charleston, South Carolina, USA.
Clinacuity, Inc., Charleston, South Carolina, USA.
AMIA Annu Symp Proc. 2021 Jan 25;2020:648-657. eCollection 2020.
De-identification of electric health record narratives is a fundamental task applying natural language processing to better protect patient information privacy. We explore different types of ensemble learning methods to improve clinical text de-identification. We present two ensemble-based approaches for combining multiple predictive models. The first method selects an optimal subset of de-identification models by greedy exclusion. This ensemble pruning allows one to save computational time or physical resources while achieving similar or better performance than the ensemble of all members. The second method uses a sequence of words to train a sequential model. For this sequence labelling-based stacked ensemble, we employ search-based structured prediction and bidirectional long short-term memory algorithms. We create ensembles consisting of de-identification models trained on two clinical text corpora. Experimental results show that our ensemble systems can effectively integrate predictions from individual models and offer better generalization across two different corpora.
去识别电子健康记录叙述是应用自然语言处理来更好地保护患者信息隐私的基本任务。我们探索了不同类型的集成学习方法来改进临床文本去识别。我们提出了两种基于集成的方法来组合多个预测模型。第一种方法通过贪婪排除选择最佳的去识别模型子集。这种集成剪枝可以节省计算时间或物理资源,同时实现与所有成员的集成相似或更好的性能。第二种方法使用单词序列来训练序列模型。对于基于序列标注的堆叠集成,我们采用基于搜索的结构化预测和双向长短期记忆算法。我们创建了由在两个临床文本语料库上训练的去识别模型组成的集成系统。实验结果表明,我们的集成系统可以有效地整合来自各个模型的预测,并在两个不同的语料库上提供更好的泛化能力。