Garrard Peter, Rentoumi Vassiliki, Gesierich Benno, Miller Bruce, Gorno-Tempini Maria Luisa
Stroke and Dementia Research Centre, St George's, University of London, Cranmer Terrace, London SW17 ORE, UK.
Stroke and Dementia Research Centre, St George's, University of London, Cranmer Terrace, London SW17 ORE, UK.
Cortex. 2014 Jun;55:122-9. doi: 10.1016/j.cortex.2013.05.008. Epub 2013 Jun 14.
Advances in automatic text classification have been necessitated by the rapid increase in the availability of digital documents. Machine learning (ML) algorithms can 'learn' from data: for instance a ML system can be trained on a set of features derived from written texts belonging to known categories, and learn to distinguish between them. Such a trained system can then be used to classify unseen texts. In this paper, we explore the potential of the technique to classify transcribed speech samples along clinical dimensions, using vocabulary data alone. We report the accuracy with which two related ML algorithms [naive Bayes Gaussian (NBG) and naive Bayes multinomial (NBM)] categorized picture descriptions produced by: 32 semantic dementia (SD) patients versus 10 healthy, age-matched controls; and SD patients with left- (n = 21) versus right-predominant (n = 11) patterns of temporal lobe atrophy. We used information gain (IG) to identify the vocabulary features that were most informative to each of these two distinctions. In the SD versus control classification task, both algorithms achieved accuracies of greater than 90%. In the right- versus left-temporal lobe predominant classification, NBM achieved a high level of accuracy (88%), but this was achieved by both NBM and NBG when the features used in the training set were restricted to those with high values of IG. The most informative features for the patient versus control task were low frequency content words, generic terms and components of metanarrative statements. For the right versus left task the number of informative lexical features was too small to support any specific inferences. An enriched feature set, including values derived from Quantitative Production Analysis (QPA) may shed further light on this little understood distinction.
数字文档数量的迅速增加使得自动文本分类技术取得了进展。机器学习(ML)算法可以从数据中“学习”:例如,可以在一组从属于已知类别的书面文本中提取的特征上训练一个ML系统,并学会区分它们。这样一个经过训练的系统随后可用于对未见文本进行分类。在本文中,我们探索了仅使用词汇数据沿着临床维度对转录语音样本进行分类的技术潜力。我们报告了两种相关的ML算法[朴素贝叶斯高斯(NBG)和朴素贝叶斯多项式(NBM)]对以下对象生成的图片描述进行分类的准确率:32名语义性痴呆(SD)患者与10名年龄匹配的健康对照;以及颞叶萎缩以左侧为主(n = 21)与右侧为主(n = 11)的SD患者。我们使用信息增益(IG)来识别对这两种区分最具信息性的词汇特征。在SD与对照的分类任务中,两种算法的准确率均超过90%。在右侧与左侧颞叶为主的分类中,NBM达到了较高的准确率(88%),但当训练集中使用的特征仅限于具有高IG值的特征时,NBM和NBG都达到了这一准确率。患者与对照任务中最具信息性的特征是低频实词、通用术语和元叙事陈述的组成部分。对于右侧与左侧任务,具有信息性的词汇特征数量太少,无法支持任何具体的推断。一个丰富的特征集,包括从定量产出分析(QPA)得出的值,可能会进一步阐明这种鲜为人知的区别。