Suppr超能文献

用杂乱(公民科学)数据进行推理:数据何时足够准确,以及如何改进数据?

Making inference with messy (citizen science) data: when are data accurate enough and how can they be improved?

机构信息

Department of Forest and Wildlife Ecology, University of Wisconsin-Madison, 1630 Linden Drive, Madison, Wisconsin, 53706, USA.

Office of Applied Sciences, Wisconsin Department of Natural Resources, Madison, Wisconsin, 53716, USA.

出版信息

Ecol Appl. 2019 Mar;29(2):e01849. doi: 10.1002/eap.1849. Epub 2019 Feb 19.

Abstract

Measurement or observation error is common in ecological data: as citizen scientists and automated algorithms play larger roles processing growing volumes of data to address problems at large scales, concerns about data quality and strategies for improving it have received greater focus. However, practical guidance pertaining to fundamental data quality questions for data users or managers-how accurate do data need to be and what is the best or most efficient way to improve it?-remains limited. We present a generalizable framework for evaluating data quality and identifying remediation practices, and demonstrate the framework using trail camera images classified using crowdsourcing to determine acceptable rates of misclassification and identify optimal remediation strategies for analysis using occupancy models. We used expert validation to estimate baseline classification accuracy and simulation to determine the sensitivity of two occupancy estimators (standard and false-positive extensions) to different empirical misclassification rates. We used regression techniques to identify important predictors of misclassification and prioritize remediation strategies. More than 93% of images were accurately classified, but simulation results suggested that most species were not identified accurately enough to permit distribution estimation at our predefined threshold for accuracy (<5% absolute bias). A model developed to screen incorrect classifications predicted misclassified images with >97% accuracy: enough to meet our accuracy threshold. Occupancy models that accounted for false-positive error provided even more accurate inference even at high rates of misclassification (30%). As simulation suggested occupancy models were less sensitive to additional false-negative error, screening models or fitting occupancy models accounting for false-positive error emerged as efficient data remediation solutions. Combining simulation-based sensitivity analysis with empirical estimation of baseline error and its variability allows users and managers of potentially error-prone data to identify and fix problematic data more efficiently. It may be particularly helpful for "big data" efforts dependent upon citizen scientists or automated classification algorithms with many downstream users, but given the ubiquity of observation or measurement error, even conventional studies may benefit from focusing more attention upon data quality.

摘要

在生态数据中,测量或观察误差很常见:随着公民科学家和自动化算法在处理大量数据以解决大规模问题方面发挥更大的作用,人们对数据质量的关注以及提高数据质量的策略受到了更多的关注。然而,对于数据用户或管理者而言,有关基本数据质量问题的实用指南——数据需要达到多高的准确性,以及提高数据质量的最佳或最有效的方法是什么?——仍然有限。我们提出了一个可用于评估数据质量和确定补救措施的通用框架,并使用众包分类的相机拍摄图像来演示该框架,以确定可接受的分类错误率,并确定使用占用模型进行分析的最佳补救策略。我们使用专家验证来估计基线分类准确性,并使用模拟来确定两种占用估计量(标准和假阳性扩展)对不同经验分类错误率的敏感性。我们使用回归技术来确定分类错误的重要预测因子,并确定补救策略的优先级。超过 93%的图像被准确分类,但模拟结果表明,大多数物种没有被准确识别,无法在我们预设的精度阈值(<5%的绝对偏差)下进行分布估计。用于筛选错误分类的模型以>97%的准确率预测了错误分类的图像:足以满足我们的准确性阈值。即使在分类错误率较高的情况下(30%),考虑了错误阳性误差的占用模型也提供了更准确的推断。由于模拟表明占用模型对额外的假阴性误差不敏感,因此筛选模型或拟合考虑错误阳性误差的占用模型成为有效的数据补救解决方案。将基于模拟的敏感性分析与对基线误差及其变异性的经验估计相结合,使用户和可能存在误差的潜在数据的管理者能够更有效地识别和修复有问题的数据。它可能对依赖于公民科学家或具有许多下游用户的自动化分类算法的“大数据”工作特别有帮助,但鉴于观察或测量误差的普遍性,即使是传统研究也可能受益于更多地关注数据质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验