Suppr超能文献

医疗出院报告中齐夫定律、幂律和对数正态分布的实证分析。

Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.

机构信息

Australian Institute of Health Innovation, Macquarie University, Sydney, Australia; Centre for Big Data Research in Health, UNSW, Sydney, Australia.

Australian Institute of Health Innovation, Macquarie University, Sydney, Australia; Westmead Applied Research Centre, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia.

出版信息

Int J Med Inform. 2021 Jan;145:104324. doi: 10.1016/j.ijmedinf.2020.104324. Epub 2020 Nov 2.

Abstract

BACKGROUND

Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions.

OBJECTIVE

This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution.

METHOD

We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data.

RESULT

Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law.

CONCLUSION

Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.

摘要

背景

贝叶斯建模和统计文本分析依赖于知情的概率先验,以鼓励得出好的解决方案。

目的

本文通过实证分析,考察医疗出院报告中的文本是否符合齐夫定律,这是语言的一种常见统计属性,即词频遵循离散幂律分布。

方法

我们检查了 MIMIC-III 数据集的 20000 份医疗出院报告。方法包括将出院报告分解为标记,统计标记频率,拟合数据的幂律分布,并检验替代分布——对数正态分布、指数分布、扩展指数分布和截断幂律分布——是否能更好地拟合数据。

结果

出院报告最好由截断幂律和对数正态分布拟合。出院报告似乎接近齐夫分布,因为截断幂律分布比纯幂律分布提供了更好的拟合。

结论

我们的研究结果表明,贝叶斯建模和统计文本分析出院报告文本将受益于使用截断幂律和对数正态概率先验以及捕获幂律行为的非参数模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验