School of Computing, The Australian National University, Canberra, ACT 2600, Australia.
Scottish Centre for Administrative Data Research (SCADR), University of Edinburgh. UK.
Int J Popul Data Sci. 2023 Jan 31;8(1):2115. doi: 10.23889/ijpds.v8i1.2115. eCollection 2023.
Databases covering all individuals of a population are increasingly used for research and decision-making. The massive size of such databases is often mistaken as a guarantee for valid inferences. However, population data have characteristics that make them challenging to use. Various assumptions on population coverage and data quality are commonly made, including how such data were captured and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases. Record linkage often implies subtle technical problems, which are easily missed. We discuss a diverse range of myths and misconceptions relevant for anybody capturing, processing, linking, or analysing population data. Remarkably, many of these myths and misconceptions are due to the social nature of data collections and are therefore missed by purely technical accounts of data processing. Many are also not well documented in scientific publications. We conclude with a set of recommendations for using population data.
数据库涵盖了一个人群的所有个体,越来越多地被用于研究和决策。此类数据库的大规模通常被误认为是有效推断的保证。然而,人群数据具有使其难以使用的特征。通常会对人群覆盖范围和数据质量做出各种假设,包括这些数据是如何被捕获的以及对其应用了哪些类型的处理。此外,只有当此类数据与其他数据库链接时,才能充分发挥其潜力。记录链接通常意味着存在细微的技术问题,这些问题很容易被忽视。我们讨论了与捕获、处理、链接或分析人群数据相关的各种神话和误解。值得注意的是,许多这些神话和误解是由于数据收集的社会性质造成的,因此仅从数据处理的技术角度来看,这些问题很容易被忽略。其中许多问题在科学出版物中也没有很好地记录。我们最后提出了一套使用人群数据的建议。