Suppr超能文献

德国健康网站的可读性和主题:探索性研究和文本分析。

Readability and topics of the German Health Web: Exploratory study and text analysis.

机构信息

Department of Medical Informatics, Heilbronn University, Heilbronn, Germany.

Center for Machine Learning, Heilbronn University, Heilbronn, Germany.

出版信息

PLoS One. 2023 Feb 10;18(2):e0281582. doi: 10.1371/journal.pone.0281582. eCollection 2023.

Abstract

BACKGROUND

The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user's health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quantify the readability of written health information, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels.

OBJECTIVE

In previous work, we showed the use of a focused crawler to "capture" and describe a large sample of the "German Health Web", which we call the "Sampled German Health Web" (sGHW). It includes health-related web content of the three mostly German speaking countries Germany, Austria, and Switzerland, i.e. country-code top-level domains (ccTLDs) ".de", ".at" and ".ch". Based on the crawled data, we now provide a fully automated readability and vocabulary analysis of a subsample of the sGHW, an analysis of the sGHW's graph structure covering its size, its content providers and a ratio of public to private stakeholders. In addition, we apply Latent Dirichlet Allocation (LDA) to identify topics and themes within the sGHW.

METHODS

Important web sites were identified by applying PageRank on the sGHW's graph representation. LDA was used to discover topics within the top-ranked web sites. Next, a computer-based readability and vocabulary analysis was performed on each health-related web page. Flesch Reading Ease (FRE) and the 4th Vienna formula (WSTF) were used to assess the readability. Vocabulary was assessed by a specifically trained Support Vector Machine classifier.

RESULTS

In total, n = 14,193,743 health-related web pages were collected during the study period of 370 days. The resulting host-aggregated web graph comprises 231,733 nodes connected via 429,530 edges (network diameter = 25; average path length = 6.804; average degree = 1.854; modularity = 0.723). Among 3000 top-ranked pages (1000 per ccTLD according to PageRank), 18.50%(555/3000) belong to web sites from governmental or public institutions, 18.03% (541/3000) from nonprofit organizations, 54.03% (1621/3000) from private organizations, 4.07% (122/3000) from news agencies, 3.87% (116/3000) from pharmaceutical companies, 0.90% (27/3000) from private bloggers, and 0.60% (18/3000) are from others. LDA identified 50 topics, which we grouped into 11 themes: "Research & Science", "Illness & Injury", "The State", "Healthcare structures", "Diet & Food", "Medical Specialities", "Economy", "Food production", "Health communication", "Family" and "Other". The most prevalent themes were "Research & Science" and "Illness & Injury" accounting for 21.04% and 17.92% of all topics across all ccTLDs and provider types, respectively. Our readability analysis reveals that the majority of the collected web sites is structurally difficult or very difficult to read: 84.63% (2539/3000) scored a WSTF ≥ 12, 89.70% (2691/3000) scored a FRE ≤ 49. Moreover, our vocabulary analysis shows that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience.

CONCLUSIONS

We were able to identify major information hubs as well as topics and themes within the sGHW. Results indicate that the readability within the sGHW is low. As a consequence, patients may face barriers, even though the vocabulary used seems appropriate from a medical perspective. In future work, the authors intend to extend their analyses to identify trustworthy health information web sites.

摘要

背景

互联网已成为获取健康信息的重要资源,尤其是对于非专业人士而言。然而,所获取的信息并不一定符合用户的健康素养水平。因此,(1)识别主要的信息提供者,(2)量化书面健康信息的可读性,以及(3)分析不同类型的信息来源如何适合不同健康素养水平的人群,这些都至关重要。

目的

在之前的工作中,我们展示了使用聚焦爬虫来“捕获”和描述“德国健康网络”的大量样本,我们称之为“抽样德国健康网络”(Sampled German Health Web,sGHW)。它包括德国、奥地利和瑞士这三个德语国家的与健康相关的网络内容,即国家代码顶级域名(country-code top-level domains,ccTLD)“.de”、“.at”和“.ch”。基于所抓取的数据,我们现在提供 sGHW 的一个子样本的完全自动化可读性和词汇分析,分析 sGHW 的图结构,包括其大小、内容提供者以及公共和私人利益相关者的比例。此外,我们应用潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)来识别 sGHW 中的主题和主题。

方法

通过在 sGHW 的图表示上应用 PageRank 来确定重要的网站。使用 LDA 来发现顶级网站中的主题。接下来,对每个与健康相关的网页进行基于计算机的可读性和词汇分析。使用弗莱什阅读舒适度(Flesch Reading Ease,FRE)和维也纳第 4 公式(Vienna Sentence Complexity Formula,WSTF)来评估可读性。词汇是通过专门训练的支持向量机分类器来评估的。

结果

在 370 天的研究期间,共收集了 n = 14,193,743 个与健康相关的网页。生成的主机聚合网络图包含 231,733 个节点,通过 429,530 条边连接(网络直径=25;平均路径长度=6.804;平均度数=1.854;模块性=0.723)。在 3000 个排名最高的网页中(根据 PageRank 排名,每个 ccTLD 有 1000 个网页),18.50%(555/3000)属于政府或公共机构的网站,18.03%(541/3000)属于非营利组织,54.03%(1621/3000)属于私人组织,4.07%(122/3000)属于新闻机构,3.87%(116/3000)属于制药公司,0.90%(27/3000)属于私人博客,0.60%(18/3000)属于其他类型的网站。LDA 确定了 50 个主题,我们将其分为 11 个主题:“研究与科学”、“疾病与伤害”、“国家”、“医疗结构”、“饮食与食物”、“医学专业”、“经济”、“食品生产”、“健康沟通”、“家庭”和“其他”。最常见的主题是“研究与科学”和“疾病与伤害”,分别占所有 ccTLD 和提供商类型的所有主题的 21.04%和 17.92%。我们的可读性分析表明,收集的大多数网站结构复杂或非常难以阅读:84.63%(2539/3000)的 WSTF≥12,89.70%(2691/3000)的 FRE≤49。此外,我们的词汇分析表明,44.00%(1320/3000)的网站使用的词汇非常适合非专业人士。

结论

我们能够识别 sGHW 中的主要信息中心以及主题和主题。结果表明,sGHW 中的可读性较低。因此,尽管从医学角度来看,所使用的词汇似乎是合适的,但患者可能会遇到障碍。在未来的工作中,作者打算扩展他们的分析,以识别值得信赖的健康信息网站。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/506e/9916670/f55b05684244/pone.0281582.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验