Suppr超能文献

一种用于自动分类口语记录和书面文本的可解释方法。

An interpretable method for automated classification of spoken transcripts and written text.

作者信息

Wahde Mattias, Della Vedova Marco L, Virgolin Marco, Suvanto Minerva

机构信息

Chalmers University of Technology, 412 96 Gothenburg, Sweden.

Evolutionary Intelligence Group, Centrum Wiskunde and Informatica, Science Park 123, Amsterdam, 1098 XG The Netherlands.

出版信息

Evol Intell. 2023 May 4:1-13. doi: 10.1007/s12065-023-00851-1.

Abstract

We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.

摘要

我们在文本分类的背景下研究口语(以广播节目文字记录的形式)和书面语(维基百科文章)之间的差异。我们提出了一种新颖的、可解释的文本分类方法,该方法涉及使用大量语法特征的线性分类器,并将其应用于一个新生成的数据集,该数据集的句子源自口语记录或书面文本。我们的分类器达到的准确率比基于深度神经网络(DNN)的常用分类器(DistilBERT)低不到0.02。此外,我们的分类器有一个综合的置信度度量,用于评估给定分类的可靠性。提供了一个在线工具来展示我们的分类器,特别是其可解释的性质,这在涉及高风险决策的分类任务中是一个关键特征。我们还研究了DistilBERT在口语或书面文本中执行填空任务的能力,并发现它在两种情况下表现相似。我们的主要结论是,通过精心改进,经典方法和基于DNN的方法之间的性能差距可能会显著缩小,以至于分类方法的选择归结为对可解释性的需求(如果有的话)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4925/10157555/81e3429988b1/12065_2023_851_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验