Google, Mountain View, CA, USA.
Nat Commun. 2022 Dec 2;13(1):7456. doi: 10.1038/s41467-022-35007-9.
Physicians write clinical notes with abbreviations and shorthand that are difficult to decipher. Abbreviations can be clinical jargon (writing "HIT" for "heparin induced thrombocytopenia"), ambiguous terms that require expertise to disambiguate (using "MS" for "multiple sclerosis" or "mental status"), or domain-specific vernacular ("cb" for "complicated by"). Here we train machine learning models on public web data to decode such text by replacing abbreviations with their meanings. We report a single translation model that simultaneously detects and expands thousands of abbreviations in real clinical notes with accuracies ranging from 92.1%-97.1% on multiple external test datasets. The model equals or exceeds the performance of board-certified physicians (97.6% vs 88.7% total accuracy). Our results demonstrate a general method to contextually decipher abbreviations and shorthand that is built without any privacy-compromising data.
医生在写临床笔记时会使用缩写和简写,这些缩写和简写很难辨认。缩写可以是临床术语(将“肝素诱导的血小板减少症”缩写为“HIT”),也可以是需要专业知识才能消除歧义的模糊术语(将“多发性硬化症”或“精神状态”缩写为“MS”),或者是特定领域的行话(将“cb”缩写为“complicated by”)。在这里,我们在公共网络数据上训练机器学习模型,通过用含义替换缩写来对这种文本进行解码。我们报告了一个单一的翻译模型,该模型可以同时检测和扩展真实临床记录中的数千个缩写,在多个外部测试数据集上的准确率范围从 92.1%到 97.1%。该模型的表现与董事会认证医生相当(总准确率为 97.6%,而 88.7%)。我们的结果展示了一种上下文推断缩写和简写的通用方法,该方法是在不损害任何隐私数据的情况下构建的。