Suppr超能文献

使用基于Transformer的自然语言处理模型双向编码器表征的多标签分类从乳腺癌患者博客中提取多种担忧:博客的信息流行病学研究

Extracting Multiple Worries From Breast Cancer Patient Blogs Using Multilabel Classification With the Natural Language Processing Model Bidirectional Encoder Representations From Transformers: Infodemiology Study of Blogs.

作者信息

Watanabe Tomomi, Yada Shuntaro, Aramaki Eiji, Yajima Hiroshi, Kizaki Hayato, Hori Satoko

机构信息

Division of Drug Informatics, Keio University Faculty of Pharmacy, Tokyo, Japan.

Nara Institute of Science and Technology, Nara, Japan.

出版信息

JMIR Cancer. 2022 Jun 3;8(2):e37840. doi: 10.2196/37840.

Abstract

BACKGROUND

Patients with breast cancer have a variety of worries and need multifaceted information support. Their accumulated posts on social media contain rich descriptions of their daily worries concerning issues such as treatment, family, and finances. It is important to identify these issues to help patients with breast cancer to resolve their worries and obtain reliable information.

OBJECTIVE

This study aimed to extract and classify multiple worries from text generated by patients with breast cancer using Bidirectional Encoder Representations From Transformers (BERT), a context-aware natural language processing model.

METHODS

A total of 2272 blog posts by patients with breast cancer in Japan were collected. Five worry labels, "treatment," "physical," "psychological," "work/financial," and "family/friends," were defined and assigned to each post. Multiple labels were allowed. To assess the label criteria, 50 blog posts were randomly selected and annotated by two researchers with medical knowledge. After the interannotator agreement had been assessed by means of Cohen kappa, one researcher annotated all the blogs. A multilabel classifier that simultaneously predicts five worries in a text was developed using BERT. This classifier was fine-tuned by using the posts as input and adding a classification layer to the pretrained BERT. The performance was evaluated for precision using the average of 5-fold cross-validation results.

RESULTS

Among the blog posts, 477 included "treatment," 1138 included "physical," 673 included "psychological," 312 included "work/financial," and 283 included "family/friends." The interannotator agreement values were 0.67 for "treatment," 0.76 for "physical," 0.56 for "psychological," 0.73 for "work/financial," and 0.73 for "family/friends," indicating a high degree of agreement. Among all blog posts, 544 contained no label, 892 contained one label, and 836 contained multiple labels. It was found that the worries varied from user to user, and the worries posted by the same user changed over time. The model performed well, though prediction performance differed for each label. The values of precision were 0.59 for "treatment," 0.82 for "physical," 0.64 for "psychological," 0.67 for "work/financial," and 0.58 for "family/friends." The higher the interannotator agreement and the greater the number of posts, the higher the precision tended to be.

CONCLUSIONS

This study showed that the BERT model can extract multiple worries from text generated from patients with breast cancer. This is the first application of a multilabel classifier using the BERT model to extract multiple worries from patient-generated text. The results will be helpful to identify breast cancer patients' worries and give them timely social support.

摘要

背景

乳腺癌患者有各种各样的担忧,需要多方面的信息支持。他们在社交媒体上积累的帖子包含了对日常担忧的丰富描述,涉及治疗、家庭和财务等问题。识别这些问题对于帮助乳腺癌患者解决担忧并获取可靠信息很重要。

目的

本研究旨在使用上下文感知自然语言处理模型——来自变换器的双向编码器表示(BERT),从乳腺癌患者生成的文本中提取并分类多种担忧。

方法

收集了日本乳腺癌患者的2272篇博客文章。定义了五个担忧标签,即“治疗”“身体”“心理”“工作/财务”和“家庭/朋友”,并为每篇文章分配标签。允许有多个标签。为了评估标签标准,随机选择了50篇博客文章,由两名具有医学知识的研究人员进行注释。在通过科恩kappa系数评估注释者间的一致性后,由一名研究人员对所有博客进行注释。使用BERT开发了一个多标签分类器,该分类器可同时预测文本中的五种担忧。通过将这些文章作为输入并在预训练的BERT上添加一个分类层来对该分类器进行微调。使用5折交叉验证结果的平均值来评估精度性能。

结果

在博客文章中,477篇包含“治疗”,1138篇包含“身体”,673篇包含“心理”,312篇包含“工作/财务”,283篇包含“家庭/朋友”。注释者间的一致性值在“治疗”方面为0.67,“身体”方面为0.76,“心理”方面为0.56,“工作/财务”方面为0.73,“家庭/朋友”方面为0.73,表明一致性程度较高。在所有博客文章中,544篇没有标签,892篇包含一个标签,836篇包含多个标签。研究发现,不同用户的担忧各不相同,且同一用户发布的担忧会随时间变化。该模型表现良好,不过每个标签的预测性能有所不同。“治疗”的精度值为0.59,“身体”为0.82,“心理”为0.64,“工作/财务”为0.67,“家庭/朋友”为0.58。注释者间的一致性越高且文章数量越多,精度往往越高。

结论

本研究表明,BERT模型可以从乳腺癌患者生成的文本中提取多种担忧。这是首次应用使用BERT模型的多标签分类器从患者生成的文本中提取多种担忧。研究结果将有助于识别乳腺癌患者的担忧并及时给予他们社会支持。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f5e/9206207/fe4c1911a918/cancer_v8i2e37840_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验