Madanay Farrah, Tu Karissa, Campagna Ada, Davis J Kelly, Doerstling Steven S, Chen Felicia, Ubel Peter A
Sanford School of Public Policy, Duke University, Durham, NC, United States.
Center for Bioethics and Social Sciences in Medicine, University of Michigan Medical School, Ann Arbor, MI, United States.
J Med Internet Res. 2024 Aug 1;26:e50236. doi: 10.2196/50236.
Patients increasingly rely on web-based physician reviews to choose a physician and share their experiences. However, the unstructured text of these written reviews presents a challenge for researchers seeking to make inferences about patients' judgments. Methods previously used to identify patient judgments within reviews, such as hand-coding and dictionary-based approaches, have posed limitations to sample size and classification accuracy. Advanced natural language processing methods can help overcome these limitations and promote further analysis of physician reviews on these popular platforms.
This study aims to train, test, and validate an advanced natural language processing algorithm for classifying the presence and valence of 2 dimensions of patient judgments in web-based physician reviews: interpersonal manner and technical competence.
We sampled 345,053 reviews for 167,150 physicians across the United States from Healthgrades.com, a commercial web-based physician rating and review website. We hand-coded 2000 written reviews and used those reviews to train and test a transformer classification algorithm called the Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach (RoBERTa). The 2 fine-tuned models coded the reviews for the presence and positive or negative valence of patients' interpersonal manner or technical competence judgments of their physicians. We evaluated the performance of the 2 models against 200 hand-coded reviews and validated the models using the full sample of 345,053 RoBERTa-coded reviews.
The interpersonal manner model was 90% accurate with precision of 0.89, recall of 0.90, and weighted F-score of 0.89. The technical competence model was 90% accurate with precision of 0.91, recall of 0.90, and weighted F-score of 0.90. Positive-valence judgments were associated with higher review star ratings whereas negative-valence judgments were associated with lower star ratings. Analysis of the data by review rating and physician gender corresponded with findings in prior literature.
Our 2 classification models coded interpersonal manner and technical competence judgments with high precision, recall, and accuracy. These models were validated using review star ratings and results from previous research. RoBERTa can accurately classify unstructured, web-based review text at scale. Future work could explore the use of this algorithm with other textual data, such as social media posts and electronic health records.
患者越来越依赖基于网络的医生评价来选择医生并分享他们的经历。然而,这些书面评价的非结构化文本给试图推断患者判断的研究人员带来了挑战。以前用于在评价中识别患者判断的方法,如手工编码和基于词典的方法,在样本量和分类准确性方面存在局限性。先进的自然语言处理方法有助于克服这些局限性,并促进对这些流行平台上医生评价的进一步分析。
本研究旨在训练、测试和验证一种先进的自然语言处理算法,用于对基于网络的医生评价中患者判断的两个维度的存在和效价进行分类:人际态度和技术能力。
我们从商业性的基于网络的医生评级和评价网站Healthgrades.com上抽取了美国167,150名医生的345,053条评价。我们对2000条书面评价进行了手工编码,并使用这些评价来训练和测试一种名为稳健优化的BERT(来自变换器的双向编码器表示)预训练方法(RoBERTa)的变换器分类算法。这两个微调模型对评价中患者对其医生的人际态度或技术能力判断的存在以及积极或消极效价进行编码。我们根据200条手工编码的评价评估了这两个模型的性能,并使用345,053条RoBERTa编码评价的完整样本来验证这些模型。
人际态度模型的准确率为90%,精确率为0.89,召回率为0.90,加权F分数为0.89。技术能力模型的准确率为90%,精确率为0.91,召回率为0.90,加权F分数为0.90。积极效价判断与较高的评价星级相关,而消极效价判断与较低的星级相关。按评价星级和医生性别对数据进行的分析与先前文献中的发现一致。
我们的两个分类模型对人际态度和技术能力判断进行编码时具有高精度、高召回率和高准确率。这些模型使用评价星级和先前研究的结果进行了验证。RoBERTa可以大规模地准确分类基于网络的非结构化评价文本。未来的工作可以探索将该算法用于其他文本数据,如社交媒体帖子和电子健康记录。