Suppr超能文献

机器学习中的临床文本数据:系统综述

Clinical Text Data in Machine Learning: Systematic Review.

作者信息

Spasic Irena, Nenadic Goran

机构信息

School of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom.

Department of Computer Science, University of Manchester, Manchester, United Kingdom.

出版信息

JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.

Abstract

BACKGROUND

Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data.

OBJECTIVE

The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice.

METHODS

Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics.

RESULTS

The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance.

CONCLUSIONS

We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.

摘要

背景

临床叙述是医疗保健中主要的沟通形式,提供患者病史和评估的个性化记录,并为临床决策提供丰富信息。自然语言处理(NLP)已多次证明其挖掘临床叙述中隐藏证据的可行性。机器学习可以通过利用大量文本数据促进NLP工具的快速开发。

目的

本研究的主要目的是提供关于用于训练临床NLP机器学习方法的文本数据属性的系统证据。我们还研究了机器学习支持的NLP任务类型以及它们如何应用于临床实践。

方法

我们的方法基于进行系统评价的指南。2018年8月,我们使用多方面接口的PubMed在MEDLINE上进行文献检索。我们确定了110项相关研究,并提取了用于支持机器学习的文本数据、支持的NLP任务及其临床应用的信息。考虑的数据属性包括其大小、来源、收集方法、注释以及任何相关统计数据。

结果

用于训练机器学习模型的大多数数据集仅包含数百或数千份文档。只有10项研究使用了数万份文档,少数研究使用的更多。即使有更大的数据集,相对较小的数据集也被用于训练。数据利用率如此之低的主要原因是监督机器学习算法面临的注释瓶颈。主动学习被探索用于迭代采样数据子集进行人工注释,作为在最小化注释工作量的同时最大化模型预测性能的策略。当将临床代码与自由文本注释集成到电子健康记录中用作类别标签时,监督学习被成功使用。同样,远程监督被用于利用现有知识库自动注释原始文本。在不可避免需要人工注释的情况下,探索了众包,但由于所考虑数据的敏感性,它仍然不合适。除了数量少之外,训练数据通常来自少数机构,因此没有关于机器学习模型可转移性的有力证据。大多数研究集中在文本分类上。最常见的是,分类结果用于支持表型分析、预后、护理改善、资源管理和监测。

结论

我们将数据注释瓶颈确定为临床NLP中机器学习方法的关键障碍之一。主动学习和远程监督被探索为节省注释工作量的一种方式。该领域未来的研究将受益于数据增强和迁移学习等替代方法,或不需要数据注释的无监督学习。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5aca/7157505/185e05f429ff/medinform_v8i3e17984_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验