Humbert-Droz Marie, Mukherjee Pritam, Gevaert Olivier
Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA, United States.
Department of Biomedical Data Science, Stanford University, Stanford, CA, United States.
JMIR Med Inform. 2022 Mar 14;10(3):e32903. doi: 10.2196/32903.
Automated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck in model development.
The aim of this study is to address the lack of labeled data by proposing 2 alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using lower-quality labels for training leads to good classification results.
We addressed the lack of labels with 2 strategies. The first approach took advantage of the structured part of electronic health records and used diagnosis codes (International Classification of Disease-10th revision) to derive training labels. The second approach used weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptom information from outpatient visit progress notes of patients with cardiovascular diseases.
We used >500,000 notes for training our classification model with International Classification of Disease-10th revision codes as labels and >800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (>500,000 documents). We further demonstrate that using weak labels for training rather than the electronic health record codes derived from the patient encounter leads to an overall improved recall score (10% improvement, on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% in the recall score.
This work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for a downstream clinical task such as clinical decision support.
由于症状描述具有多维度性质,从临床记录中自动提取症状是一项具有挑战性的任务。由于数据包含受保护的健康信息,带标签的训练数据极其有限。利用自然语言处理和机器学习来处理此类任务的临床文本具有巨大潜力。然而,监督式机器学习需要大量带标签的数据来训练模型,这是模型开发主要瓶颈的根源。
本研究旨在通过为使用英文临床文本进行监督式机器学习生成训练标签,提出两种替代人工标注的方法,以解决带标签数据的不足问题。我们旨在证明使用质量较低的标签进行训练能产生良好的分类结果。
我们用两种策略解决标签不足的问题。第一种方法利用电子健康记录的结构化部分,使用诊断代码(国际疾病分类第十版)来推导训练标签。第二种方法使用弱监督和数据编程原则来推导训练标签。我们建议将所开发的框架应用于从心血管疾病患者的门诊就诊病程记录中提取症状信息。
我们使用超过500,000份记录,以国际疾病分类第十版代码作为标签来训练我们的分类模型,并使用超过800,000份记录来训练基于弱监督推导的标签。我们表明,只要使用足够大的训练集(>500,000份文档),患病率与召回率之间的相关性就会趋于平稳。我们进一步证明,使用弱标签进行训练而非从患者就诊中得出的电子健康记录代码,能使召回分数总体提高(平均提高10%)。最后,我们模型的外部验证显示出优异的预测性能和可转移性,召回分数总体提高了20%。
这项工作展示了使用弱标注流程来注释和提取临床文本中症状提及的作用,有望促进症状信息整合,用于诸如临床决策支持等下游临床任务。