文献检索，用中文搜 PubMed

BACKGROUND

The internet has become an increasingly important resource for health information, especially for lay people. However, the information found does not necessarily comply with the user's health literacy level. Therefore, it is vital to (1) identify prominent information providers, (2) quantify the readability of written health information, and (3) to analyze how different types of information sources are suited for people with differing health literacy levels.

OBJECTIVE

In previous work, we showed the use of a focused crawler to "capture" and describe a large sample of the "German Health Web", which we call the "Sampled German Health Web" (sGHW). It includes health-related web content of the three mostly German speaking countries Germany, Austria, and Switzerland, i.e. country-code top-level domains (ccTLDs) ".de", ".at" and ".ch". Based on the crawled data, we now provide a fully automated readability and vocabulary analysis of a subsample of the sGHW, an analysis of the sGHW's graph structure covering its size, its content providers and a ratio of public to private stakeholders. In addition, we apply Latent Dirichlet Allocation (LDA) to identify topics and themes within the sGHW.

METHODS

Important web sites were identified by applying PageRank on the sGHW's graph representation. LDA was used to discover topics within the top-ranked web sites. Next, a computer-based readability and vocabulary analysis was performed on each health-related web page. Flesch Reading Ease (FRE) and the 4th Vienna formula (WSTF) were used to assess the readability. Vocabulary was assessed by a specifically trained Support Vector Machine classifier.

RESULTS

In total, n = 14,193,743 health-related web pages were collected during the study period of 370 days. The resulting host-aggregated web graph comprises 231,733 nodes connected via 429,530 edges (network diameter = 25; average path length = 6.804; average degree = 1.854; modularity = 0.723). Among 3000 top-ranked pages (1000 per ccTLD according to PageRank), 18.50%(555/3000) belong to web sites from governmental or public institutions, 18.03% (541/3000) from nonprofit organizations, 54.03% (1621/3000) from private organizations, 4.07% (122/3000) from news agencies, 3.87% (116/3000) from pharmaceutical companies, 0.90% (27/3000) from private bloggers, and 0.60% (18/3000) are from others. LDA identified 50 topics, which we grouped into 11 themes: "Research & Science", "Illness & Injury", "The State", "Healthcare structures", "Diet & Food", "Medical Specialities", "Economy", "Food production", "Health communication", "Family" and "Other". The most prevalent themes were "Research & Science" and "Illness & Injury" accounting for 21.04% and 17.92% of all topics across all ccTLDs and provider types, respectively. Our readability analysis reveals that the majority of the collected web sites is structurally difficult or very difficult to read: 84.63% (2539/3000) scored a WSTF ≥ 12, 89.70% (2691/3000) scored a FRE ≤ 49. Moreover, our vocabulary analysis shows that 44.00% (1320/3000) web sites use vocabulary that is well suited for a lay audience.

CONCLUSIONS

We were able to identify major information hubs as well as topics and themes within the sGHW. Results indicate that the readability within the sGHW is low. As a consequence, patients may face barriers, even though the vocabulary used seems appropriate from a medical perspective. In future work, the authors intend to extend their analyses to identify trustworthy health information web sites.

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

互联网已成为获取健康信息的重要资源，尤其是对于非专业人士而言。然而，所获取的信息并不一定符合用户的健康素养水平。因此，（1）识别主要的信息提供者，（2）量化书面健康信息的可读性，以及（3）分析不同类型的信息来源如何适合不同健康素养水平的人群，这些都至关重要。

目的

在之前的工作中，我们展示了使用聚焦爬虫来“捕获”和描述“德国健康网络”的大量样本，我们称之为“抽样德国健康网络”（Sampled German Health Web，sGHW）。它包括德国、奥地利和瑞士这三个德语国家的与健康相关的网络内容，即国家代码顶级域名（country-code top-level domains，ccTLD）“.de”、“.at”和“.ch”。基于所抓取的数据，我们现在提供 sGHW 的一个子样本的完全自动化可读性和词汇分析，分析 sGHW 的图结构，包括其大小、内容提供者以及公共和私人利益相关者的比例。此外，我们应用潜在狄利克雷分配（Latent Dirichlet Allocation，LDA）来识别 sGHW 中的主题和主题。

方法

通过在 sGHW 的图表示上应用 PageRank 来确定重要的网站。使用 LDA 来发现顶级网站中的主题。接下来，对每个与健康相关的网页进行基于计算机的可读性和词汇分析。使用弗莱什阅读舒适度（Flesch Reading Ease，FRE）和维也纳第 4 公式（Vienna Sentence Complexity Formula，WSTF）来评估可读性。词汇是通过专门训练的支持向量机分类器来评估的。

结果

在 370 天的研究期间，共收集了 n = 14,193,743 个与健康相关的网页。生成的主机聚合网络图包含 231,733 个节点，通过 429,530 条边连接（网络直径=25；平均路径长度=6.804；平均度数=1.854；模块性=0.723）。在 3000 个排名最高的网页中（根据 PageRank 排名，每个 ccTLD 有 1000 个网页），18.50%（555/3000）属于政府或公共机构的网站，18.03%（541/3000）属于非营利组织，54.03%（1621/3000）属于私人组织，4.07%（122/3000）属于新闻机构，3.87%（116/3000）属于制药公司，0.90%（27/3000）属于私人博客，0.60%（18/3000）属于其他类型的网站。LDA 确定了 50 个主题，我们将其分为 11 个主题：“研究与科学”、“疾病与伤害”、“国家”、“医疗结构”、“饮食与食物”、“医学专业”、“经济”、“食品生产”、“健康沟通”、“家庭”和“其他”。最常见的主题是“研究与科学”和“疾病与伤害”，分别占所有 ccTLD 和提供商类型的所有主题的 21.04%和 17.92%。我们的可读性分析表明，收集的大多数网站结构复杂或非常难以阅读：84.63%（2539/3000）的 WSTF≥12，89.70%（2691/3000）的 FRE≤49。此外，我们的词汇分析表明，44.00%（1320/3000）的网站使用的词汇非常适合非专业人士。