School of Information, University of Arizona, Tucson, USA.
Department of Biostatistics and Health Informatics, King's College London, London, United Kingdom.
Yearb Med Inform. 2021 Aug;30(1):239-244. doi: 10.1055/s-0041-1726522. Epub 2021 Sep 3.
We survey recent work in biomedical NLP on building more adaptable or generalizable models, with a focus on work dealing with electronic health record (EHR) texts, to better understand recent trends in this area and identify opportunities for future research.
We searched PubMed, the Institute of Electrical and Electronics Engineers (IEEE), the Association for Computational Linguistics (ACL) anthology, the Association for the Advancement of Artificial Intelligence (AAAI) proceedings, and Google Scholar for the years 2018-2020. We reviewed abstracts to identify the most relevant and impactful work, and manually extracted data points from each of these papers to characterize the types of methods and tasks that were studied, in which clinical domains, and current state-of-the-art results.
The ubiquity of pre-trained transformers in clinical NLP research has contributed to an increase in domain adaptation and generalization-focused work that uses these models as the key component. Most recently, work has started to train biomedical transformers and to extend the fine-tuning process with additional domain adaptation techniques. We also highlight recent research in cross-lingual adaptation, as a special case of adaptation.
While pre-trained transformer models have led to some large performance improvements, general domain pre-training does not always transfer adequately to the clinical domain due to its highly specialized language. There is also much work to be done in showing that the gains obtained by pre-trained transformers are beneficial in real world use cases. The amount of work in domain adaptation and transfer learning is limited by dataset availability and creating datasets for new domains is challenging. The growing body of research in languages other than English is encouraging, and more collaboration between researchers across the language divide would likely accelerate progress in non-English clinical NLP.
我们调查了生物医学自然语言处理领域中最近关于构建更具适应性或可泛化模型的工作,重点是处理电子健康记录 (EHR) 文本的工作,以更好地了解该领域的最新趋势并确定未来研究的机会。
我们在 2018 年至 2020 年期间在 PubMed、电气和电子工程师协会 (IEEE)、计算语言学协会 (ACL) 文集、人工智能促进协会 (AAAI) 会议和 Google Scholar 上进行了搜索。我们查阅了摘要,以确定最相关和最有影响力的工作,并从这些论文中的每一篇中手动提取数据点,以描述所研究的方法和任务的类型、临床领域以及当前的最新技术水平。
预训练的转换器在临床 NLP 研究中的普及导致了更多关注领域适应和泛化的工作,这些工作将这些模型作为关键组成部分。最近,已经开始训练生物医学转换器,并通过额外的领域适应技术扩展微调过程。我们还强调了最近在跨语言适应方面的研究,这是适应的一个特殊情况。
虽然预训练的转换器模型已经取得了一些性能的大幅提高,但由于其高度专业化的语言,通用领域的预训练并不总是能够充分转移到临床领域。在展示通过预训练的转换器获得的收益在实际用例中是有益的方面,还有很多工作要做。领域适应和迁移学习的工作数量受到数据集可用性的限制,为新领域创建数据集具有挑战性。其他语言(不仅仅是英语)的研究工作越来越多,语言障碍之外的研究人员之间的更多合作可能会加速非英语临床 NLP 的进展。