Data61 CSIRO, GPO Box 1538, Hobart, TAS, 7001, Australia.
Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, Helsinki, FIN-00014, Finland.
Ecol Appl. 2021 Sep;31(6):e02360. doi: 10.1002/eap.2360. Epub 2021 Jun 29.
Data are currently being used, and reused, in ecological research at an unprecedented rate. To ensure appropriate reuse however, we need to ask the question: "Are aggregated databases currently providing the right information to enable effective and unbiased reuse?" We investigate this question, with a focus on designs that purposefully favor the selection of sampling locations (upweighting the probability of selection of some locations). These designs are common and examples are those designs that have uneven inclusion probabilities or are stratified. We perform a simulation experiment by creating data sets with progressively more uneven inclusion probabilities and examine the resulting estimates of the average number of individuals per unit area (density). The effect of ignoring the survey design can be profound, with biases of up to 250% in density estimates when naive analytical methods are used. This density estimation bias is not reduced by adding more data. Fortunately, the estimation bias can be mitigated by using an appropriate estimator or an appropriate model that incorporates the design information. These are only available however, when essential information about the survey design is available: the sample location selection process (e.g., inclusion probabilities), and/or covariates used in their specification. The results suggest that such information must be stored and served with the data to support meaningful inference and data reuse.
目前,生态研究正以前所未有的速度使用和重复使用数据。然而,为了确保适当的重复使用,我们需要问一个问题:“聚合数据库目前是否提供了正确的信息,以实现有效和无偏的重复使用?”我们研究了这个问题,重点是那些有意偏向采样地点选择的设计(增加某些地点被选中的概率)。这些设计很常见,例如那些具有不均匀纳入概率或分层的设计。我们通过创建具有渐进不均匀纳入概率的数据来进行模拟实验,并检查单位面积上个体平均数量(密度)的结果估计值。忽略调查设计的影响可能是深远的,当使用天真的分析方法时,密度估计值的偏差高达 250%。当添加更多数据时,这种密度估计偏差不会减少。幸运的是,通过使用适当的估计器或适当的模型来纳入设计信息,可以减轻估计偏差。然而,只有在可用有关调查设计的基本信息时,这些信息才可用:抽样地点选择过程(例如,纳入概率)和/或用于指定的协变量。结果表明,必须存储和提供此类信息与数据一起,以支持有意义的推理和数据重用。