如何将零样本学习应用于物质使用研究中的文本数据：带有媒体数据的概述和教程。

How to apply zero-shot learning to text data in substance use research: An overview and tutorial with media data.

机构信息

Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia.

Computer Science and Information Technology, La Trobe University, Melbourne, Australia.

出版信息

Addiction. 2024 May;119(5):951-959. doi: 10.1111/add.16427. Epub 2024 Jan 11.

DOI:10.1111/add.16427

PMID:38212974

Abstract

A vast amount of media-related text data is generated daily in the form of social media posts, news stories or academic articles. These text data provide opportunities for researchers to analyse and understand how substance-related issues are being discussed. The main methods to analyse large text data (content analyses or specifically trained deep-learning models) require substantial manual annotation and resources. A machine-learning approach called 'zero-shot learning' may be quicker, more flexible and require fewer resources. Zero-shot learning uses models trained on large, unlabelled (or weakly labelled) data sets to classify previously unseen data into categories on which the model has not been specifically trained. This means that a pre-existing zero-shot learning model can be used to analyse media-related text data without the need for task-specific annotation or model training. This approach may be particularly important for analysing data that is time critical. This article describes the relatively new concept of zero-shot learning and how it can be applied to text data in substance use research, including a brief practical tutorial.

摘要

大量与媒体相关的文本数据每天以社交媒体帖子、新闻报道或学术文章的形式生成。这些文本数据为研究人员提供了分析和了解物质相关问题讨论方式的机会。主要的分析大量文本数据的方法（内容分析或专门训练的深度学习模型）需要大量的手动注释和资源。一种称为“零样本学习”的机器学习方法可能更快、更灵活且需要更少的资源。零样本学习使用在大型无标签（或弱标签）数据集上训练的模型，将以前未见过的数据分类到模型未经过专门训练的类别中。这意味着可以使用预先存在的零样本学习模型来分析与媒体相关的文本数据，而无需进行特定于任务的注释或模型训练。对于分析时间紧迫的数据，这种方法可能尤为重要。本文介绍了零样本学习的相对较新概念，以及如何将其应用于物质使用研究中的文本数据，包括简要的实践教程。