Department of Entomology and Nematology, University of California Davis, One Shields Avenue, Davis, CA, USA.
J Econ Entomol. 2021 Aug 5;114(4):1842-1846. doi: 10.1093/jee/toab127.
Each year, consultants and field scouts working in commercial agriculture undertake a massive, decentralized data collection effort as they monitor insect populations to make real-time pest management decisions. These data, if integrated into a database, offer rich opportunities for applying big data or ecoinformatics methods in agricultural entomology research. However, questions have been raised about whether or not the underlying quality of these data is sufficiently high to be a foundation for robust research. Here I suggest that repeatability analysis can be used to quantify the quality of data collected from commercial field scouting, without requiring any additional data gathering by researchers. In this context, repeatability quantifies the proportion of total variance across all insect density estimates that is explained by differences across populations and is thus a measure of the underlying reliability of observations. Repeatability was moderately high for cotton fields scouted commercially for total Lygus hesperus Knight densities (R = 0.631) and further improved by accounting for observer effects (R = 0.697). Repeatabilities appeared to be somewhat lower than those computed for a comparable, but much smaller, researcher-generated data set. In general, the much larger sizes of ecoinformatics data sets are likely to more than compensate for modest reductions in measurement precision. Tools for evaluating data quality are important for building confidence in the growing applications of ecoinformatics methods.
每年,从事商业农业的顾问和实地考察员都会进行大规模的、分散的数据收集工作,以监测昆虫种群,做出实时的害虫管理决策。如果将这些数据整合到数据库中,将为农业昆虫学研究中应用大数据或生态信息学方法提供丰富的机会。然而,人们对于这些数据的基础质量是否足够高,是否可以作为稳健研究的基础,存在疑问。在这里,我建议可以使用可重复性分析来量化从商业实地考察中收集的数据的质量,而不需要研究人员进行任何额外的数据收集。在这种情况下,可重复性可以量化所有昆虫密度估计值中因种群间差异而产生的总方差的比例,因此是对观察结果的基础可靠性的度量。对于总棉田李氏叶蝉密度的商业考察(R = 0.631),可重复性适中,并且通过考虑观察者效应(R = 0.697)进一步提高。可重复性似乎略低于为可比但小得多的研究人员生成的数据集中计算出的可重复性。一般来说,生态信息学数据集的规模要大得多,足以弥补测量精度的适度降低。评估数据质量的工具对于建立对生态信息学方法不断增长的应用的信心非常重要。