Suppr超能文献

VetTag:通过大规模语言模型改进自动兽医诊断编码

VetTag: improving automated veterinary diagnosis coding via large-scale language modeling.

作者信息

Zhang Yuhui, Nie Allen, Zehnder Ashley, Page Rodney L, Zou James

机构信息

1Department of Computer Science and Technology, Tsinghua University, Beijing, China.

2Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA.

出版信息

NPJ Digit Med. 2019 May 8;2:35. doi: 10.1038/s41746-019-0113-1. eCollection 2019.

Abstract

Unlike human medical records, most of the veterinary records are free text without standard diagnosis coding. The lack of systematic coding is a major barrier to the growing interest in leveraging veterinary records for public health and translational research. Recent machine learning effort is limited to predicting 42 top-level diagnosis categories from veterinary notes. Here we develop a large-scale algorithm to automatically predict all 4577 standard veterinary diagnosis codes from free text. We train our algorithm on a curated dataset of over 100 K expert labeled veterinary notes and over one million unlabeled notes. Our algorithm is based on the adapted Transformer architecture and we demonstrate that large-scale language modeling on the unlabeled notes via pretraining and as an auxiliary objective during supervised learning greatly improves performance. We systematically evaluate the performance of the model and several baselines in challenging settings where algorithms trained on one hospital are evaluated in a different hospital with substantial domain shift. In addition, we show that hierarchical training can address severe data imbalances for fine-grained diagnosis with a few training cases, and we provide interpretation for what is learned by the deep network. Our algorithm addresses an important challenge in veterinary medicine, and our model and experiments add insights into the power of unsupervised learning for clinical natural language processing.

摘要

与人类医疗记录不同,大多数兽医记录是自由文本,没有标准的诊断编码。缺乏系统编码是利用兽医记录进行公共卫生和转化研究的兴趣日益增长的主要障碍。最近的机器学习工作仅限于从兽医记录中预测42个顶级诊断类别。在此,我们开发了一种大规模算法,可从自由文本中自动预测所有4577个标准兽医诊断代码。我们在一个由超过10万条专家标注的兽医记录和超过100万条未标注记录组成的精选数据集上训练我们的算法。我们的算法基于经过改编的Transformer架构,并且我们证明,通过预训练以及在监督学习期间作为辅助目标,对未标注记录进行大规模语言建模可极大地提高性能。我们在具有挑战性的环境中系统地评估了该模型和几个基线的性能,在这种环境中,在一家医院训练的算法在另一家存在显著领域转移的医院进行评估。此外,我们表明分层训练可以用少量训练案例解决细粒度诊断中的严重数据不平衡问题,并且我们对深度网络所学内容进行了解释。我们的算法解决了兽医学中的一个重要挑战,我们的模型和实验为无监督学习在临床自然语言处理中的作用提供了见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9cab/6550141/9a8d8c8f1597/41746_2019_113_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验