Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong.
Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.
The lay public is increasingly using ChatGPT (a large language model) as a source of medical information. Traditional search engines such as Google provide several distinct responses to each search query and indicate the source for each response, but ChatGPT provides responses in paragraph form in prose without providing the sources used, which makes it difficult or impossible to ascertain whether those sources are reliable. One practical method to infer the sources used by ChatGPT is text network analysis. By understanding how ChatGPT uses source information in relation to traditional search engines, physicians and physician organizations can better counsel patients on the use of this new tool.
QUESTIONS/PURPOSES: (1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic's level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT?
We evaluated three search queries relating to orthopaedic conditions: "What is the cause of carpal tunnel syndrome?," "What is the cause of tennis elbow?," and "Platelet-rich plasma for thumb arthritis?" These were selected because of their relatively high, medium, and low consensus in the medical evidence, respectively. Each question was posed to ChatGPT version 3.5 and version 4.0 20 times for a total of 120 responses. Text network analysis using term frequency-inverse document frequency (TF-IDF) was used to compare text similarity between responses from ChatGPT and Google Search. In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a key word to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google search result to assess their text similarity to each other. In this way, text similarity can be used to infer relative content similarity. To answer our first question, we characterized the text similarity between ChatGPT and Google Search responses by finding the TF-IDF scores of the ChatGPT response and each of the 20 Google Search results for each question. Using these scores, we could compare the similarity of each ChatGPT response to the Google Search results. To provide a reference point for interpreting TF-IDF values, we generated randomized text samples with the same term distribution as the Google Search results. By comparing ChatGPT TF-IDF to the random text sample, we could assess whether TF-IDF values were statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity. To answer our second question, we classified the Google Search results to better understand sourcing. Google Search provides 20 or more distinct sources of information, but ChatGPT gives only a single prose paragraph in response to each query. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven by one of four source categories: academic, government, commercial, or material that took the form of a scientific manuscript but was not peer-reviewed or indexed on a government site (such as PubMed). We then compared the TF-IDF similarity between ChatGPT responses and the source category. To answer our third research question, we repeated both analyses and compared the results when using ChatGPT 3.5 versus ChatGPT 4.0.
The ChatGPT response was dominated by the top Google Search result. For example, for carpal tunnel syndrome, the top result was an academic website with a mean TF-IDF of 7.2. A similar result was observed for the other search topics. To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have a mean TF-IDF of 2.7 ± 1.9, controlling for text length and keyword distribution. The observed TF-IDF distribution was higher for ChatGPT responses than for random text samples, supporting the claim that keyword text similarity is a measure of relative content similarity. When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search. For the subject where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.6 versus 0.2). ChatGPT 4.0 had higher text similarity to Google Search results than ChatGPT 3.5 (mean increase in TF-IDF similarity of 0.80 to 0.91; p < 0.001). The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics.
ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to platelet-rich plasma, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0.
Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response, while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.
公众越来越多地将 ChatGPT(一种大型语言模型)用作医学信息的来源。传统搜索引擎(如谷歌)会针对每个搜索查询提供多个不同的回复,并指出每个回复的来源,但 ChatGPT 以散文形式提供段落式回复,而不提供使用的来源,这使得很难或不可能确定这些来源是否可靠。推断 ChatGPT 使用的来源的一种实用方法是文本网络分析。通过了解 ChatGPT 如何将源信息与传统搜索引擎相关联,医生和医生组织可以更好地就如何使用这个新工具向患者提供咨询。
问题/目的:(1)在关键内容词方面,与矫形外科相关主题的查询相比,ChatGPT 和 Google 搜索的回复有何相似之处?(2)基于主题的医学共识水平,谷歌搜索回复的来源分布(学术、政府、商业或科学手稿形式)是否存在差异,以及这如何反映在 ChatGPT 和 Google 搜索回复之间的文本相似度中?(3)这些结果在不同版本的 ChatGPT 之间是否存在差异?
我们评估了三个与矫形条件相关的搜索查询:“腕管综合征的原因是什么?”、“网球肘的原因是什么?”和“拇指关节炎的富血小板血浆?” 选择这些问题是因为它们在医学证据方面分别具有相对较高、中、低的共识。每个问题都向 ChatGPT 版本 3.5 和 4.0 提出了 20 次,总共得到 120 个回复。使用术语频率逆文档频率(TF-IDF)的文本网络分析用于比较 ChatGPT 和 Google 搜索回复之间的文本相似性。在信息检索领域,TF-IDF 是文档集合中关键词对文档重要性的加权统计度量。较高的 TF-IDF 分数表示两个来源之间的相似性越高。TF-IDF 分数通常用于比较和排列文档的文本相似性。使用这种类型的文本网络分析,可以通过计算和汇总 ChatGPT 回复中的所有关键词的 TF-IDF 分数,并将其与每个 Google 搜索结果进行比较,来确定 ChatGPT 和 Google 搜索之间的文本相似性,从而评估它们之间的文本相似性。通过这种方式,可以使用文本相似性来推断相对内容相似性。为了回答我们的第一个问题,我们通过找到 ChatGPT 回复和每个问题的 20 个 Google 搜索结果的 TF-IDF 分数来描述 ChatGPT 和 Google 搜索回复之间的文本相似性。通过使用这些分数,我们可以比较每个 ChatGPT 回复与 Google 搜索结果的相似性。为了提供解释 TF-IDF 值的参考点,我们生成了具有与 Google 搜索结果相同术语分布的随机文本样本。通过将 ChatGPT TF-IDF 与随机文本样本进行比较,我们可以评估 TF-IDF 值是否与随机机会获得的 TF-IDF 值具有统计学意义,并且它允许我们测试文本相似性是否是相对内容相似性的适当定量统计度量。为了回答我们的第二个问题,我们对 Google 搜索结果进行分类,以更好地了解来源。Google 搜索提供了 20 个或更多不同的信息来源,但 ChatGPT 仅对每个查询回复一个散文段落。因此,为了回答这个问题,我们使用 TF-IDF 来确定 ChatGPT 回复是否主要来自四个来源类别之一:学术、政府、商业或采用科学手稿但未经同行评审或在政府网站(如 PubMed)索引的形式。然后,我们比较了 ChatGPT 回复与源类别的 TF-IDF 相似性。为了回答我们的第三个研究问题,我们重复了这两个分析,并比较了使用 ChatGPT 3.5 与 ChatGPT 4.0 时的结果。
ChatGPT 回复主要由 Google 搜索的顶级结果驱动。例如,对于腕管综合征,顶级结果是学术网站,平均 TF-IDF 为 7.2。其他搜索主题也观察到了类似的结果。为了提供解释 TF-IDF 值的参考点,与 Google 搜索相比,随机生成的文本样本的平均 TF-IDF 为 2.7±1.9,控制了文本长度和关键词分布。与随机文本样本相比,观察到的 TF-IDF 分布在 ChatGPT 回复中更高,这支持了关键词文本相似性是相对内容相似性的度量这一说法。在比较来源分布时,ChatGPT 回复与 Google 搜索中最常见的来源类别最为相似。对于共识度较高的主题(腕管综合征),ChatGPT 回复与高质量学术资源最为相似,而不是与低质量商业资源(TF-IDF 8.6 与 2.2)相似。对于共识度较低的主题,ChatGPT 回复与高质量学术网站相比与低质量商业网站相似(TF-IDF 14.6 与 0.2)。与 ChatGPT 3.5 相比,ChatGPT 4.0 对 Google 搜索结果的文本相似性更高(平均 TF-IDF 相似性增加 0.80 至 0.91;p<0.001)。ChatGPT 4.0 回复仍然主要由 Google 搜索的顶级结果驱动,并反映了所有搜索主题中最常见的搜索类别。
ChatGPT 回复与矫形外科相关查询的单个 Google 搜索结果相似,但来源信息的分布可能会根据主题的共识程度而有很大差异。例如,对于腕管综合征,由于医学共识广泛,ChatGPT 回复与学术资源的相似性更高,因此使用了这些资源。当可用的学术或政府资源较少时,尤其是在我们与富血小板血浆相关的搜索中,ChatGPT 似乎更多地依赖于少数非学术资源。即使 ChatGPT 从版本 3.5 更新到版本 4.0,这些发现仍然存在。
医生应该意识到 ChatGPT 和 Google 可能会针对特定问题使用相同的来源。主要区别在于,ChatGPT 可以利用多个来源创建一个聚合回复,而 Google 则通过提供多个结果来保持其独特性。对于共识度较低且因此来源数量较少的主题,ChatGPT 更有可能使用不太可靠的来源,在这种情况下,医生应该花时间向患者解释该主题或提供提供更可靠信息的资源。医生组织应该明确指出证据有限的情况,以便 ChatGPT 可以反映缺乏高质量信息或证据。