Suppr超能文献

探索残疾人的背景:语义分类测试以及来自Reddit的词嵌入的环境因素映射

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit.

作者信息

Garcia-Rudolph Alejandro, Saurí Joan, Cegarra Blanca, Bernabeu Guitart Montserrat

机构信息

Institut Guttmann Hospital de Neurorehabilitacio, Badalona, Spain.

Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain.

出版信息

JMIR Med Inform. 2020 Nov 20;8(11):e17903. doi: 10.2196/17903.

Abstract

BACKGROUND

The World Health Organization's International Classification of Functioning Disability and Health (ICF) conceptualizes disability not solely as a problem that resides in the individual, but as a health experience that occurs in a context. Word embeddings build on the idea that words that occur in similar contexts tend to have similar meanings. In spite of both sharing "context" as a key component, word embeddings have been scarcely applied in disability. In this work, we propose social media (particularly, Reddit) to link them.

OBJECTIVE

The objective of our study is to train a model for generating word associations using a small dataset (a subreddit on disability) able to retrieve meaningful content. This content will be formally validated and applied to the discovery of related terms in the corpus of the disability subreddit that represent the physical, social, and attitudinal environment (as defined by a formal framework like the ICF) of people with disabilities.

METHODS

Reddit data were collected from pushshift.io with the pushshiftr R package as a wrapper. A word2vec model was trained with the wordVectors R package using the disability subreddit comments, and a preliminary validation was performed using a subset of Mikolov analogies. We used Van Overschelde's updated and expanded version of the Battig and Montague norms to perform a semantic categories test. Silhouette coefficients were calculated using cosine distance from the wordVectors R package. For each of the 5 ICF environmental factors (EF), we selected representative subcategories addressing different aspects of daily living (ADLs); then, for each subcategory, we identified specific terms extracted from their formal ICF definition and ran the word2vec model to generate their nearest semantic terms, validating the obtained nearest semantic terms using public evidence. Finally, we applied the model to a specific subcategory of an EF involved in a relevant use case in the field of rehabilitation.

RESULTS

We analyzed 96,314 comments posted between February 2009 and December 2019, by 10,411 Redditors. We trained word2vec and identified more than 30 analogies (eg, breakfast - 8 am + 8 pm = dinner). The semantic categorization test showed promising results over 60 categories; for example, s(A relative)=0.562, s(A sport)=0.475 provided remarkable explanations for low s values. We mapped the representative subcategories of all EF chapters and obtained the closest terms for each, which we confirmed with publications. This allowed immediate access (≤ 2 seconds) to the terms related to ADLs, ranging from apps "to know accessibility before you go" to adapted sports (boccia). For example, for the support and relationships EF subcategory, the closest term discovered by our model was "resilience," recently regarded as a key feature of rehabilitation, not yet having one unified definition. Our model discovered 10 closest terms, which we validated with publications, contributing to the "resilience" definition.

CONCLUSIONS

This study opens up interesting opportunities for the exploration and discovery of the use of a word2vec model that has been trained with a small disability dataset, leading to immediate, accurate, and often unknown (for authors, in many cases) terms related to ADLs within the ICF framework.

摘要

背景

世界卫生组织的《国际功能、残疾和健康分类》(ICF)将残疾不仅仅概念化为个体自身的问题,而是一种在特定情境中发生的健康体验。词嵌入基于这样一种理念,即在相似情境中出现的词往往具有相似的含义。尽管二者都将“情境”作为关键要素,但词嵌入在残疾领域的应用却很少。在本研究中,我们提议利用社交媒体(特别是Reddit)将它们联系起来。

目的

我们研究的目的是使用一个能够检索有意义内容的小数据集(一个关于残疾的子版块)训练一个用于生成词关联的模型。这些内容将经过正式验证,并应用于在残疾子版块语料库中发现相关术语,这些术语代表了残疾人的身体、社会和态度环境(如由ICF这样的正式框架所定义)。

方法

使用pushshiftr R包作为包装器,从pushshift.io收集Reddit数据。使用wordVectors R包,利用残疾子版块的评论训练一个word2vec模型,并使用米科洛夫类比的一个子集进行初步验证。我们使用范·奥弗谢尔德对巴蒂格和蒙塔古规范的更新和扩展版本进行语义类别测试。使用来自wordVectors R包的余弦距离计算轮廓系数。对于ICF的5个环境因素(EF)中的每一个,我们选择了代表日常生活不同方面(ADL)的代表性子类别;然后,对于每个子类别,我们从其正式的ICF定义中识别出特定术语,并运行word2vec模型以生成其最接近的语义术语,使用公开证据验证所获得的最接近的语义术语。最后,我们将该模型应用于康复领域一个相关用例中所涉及的EF的一个特定子类别。

结果

我们分析了2009年2月至2019年12月期间10411名Reddit用户发布的96314条评论。我们训练了word2vec并识别出30多个类比(例如,早餐 - 上午8点 + 晚上8点 = 晚餐)。语义分类测试在60多个类别中显示出有前景的结果;例如,s(亲属)=0.562,s(一项运动)=0.475对低s值提供了显著解释。我们绘制了所有EF章节的代表性子类别,并为每个子类别获得了最接近的术语,我们通过出版物对其进行了确认。这使得能够立即(≤2秒)获取与ADL相关的术语,范围从“出行前了解无障碍情况”的应用程序到适应性运动(滚球)。例如,对于支持与关系EF子类别,我们的模型发现的最接近的术语是“恢复力”,最近它被视为康复的一个关键特征,但尚未有一个统一的定义。我们的模型发现了10个最接近的术语,我们通过出版物对其进行了验证,为“恢复力”的定义做出了贡献。

结论

本研究为探索和发现使用一个用小残疾数据集训练的word2vec模型开辟了有趣的机会,从而在ICF框架内立即获得与ADL相关的准确且往往未知(在许多情况下对作者而言)的术语。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa60/7718084/4c06af1caa54/medinform_v8i11e17903_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验