Harwell Linda C, Vivian Deborah N, McLaughlin Michelle D, Hafner Stephen F
Gulf Ecology Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, U.S. Environmental Protection Agency, Gulf Breeze, Florida, USA.
Student Services Contractor, Oak Ridge Associated Universities, Oak Ridge, Tennessee, USA.
Front Environ Sci. 2019 Jun 4;7(Article 72):1-13. doi: 10.3389/fenvs.2019.00072.
The increased availability of publicly available data is, in many ways, changing our approach to conducting research. Not only are cloud-based information resources providing supplementary data to bolster traditional scientific activities (e.g., field studies, laboratory experiments), they also serve as the foundation for secondary data research projects such as indicator development. Indicators and indices are a convenient way to synthesize disparate information to address complex scientific questions that are difficult to measure directly (e.g., resilience, sustainability, well-being). In the current literature, there is no shortage of indicator or index examples derived from secondary data with a growing number that are scientifically focused. However, little information is provided describing the management approaches and best practices used to govern the data underpinnings supporting these efforts. From acquisition to storage and maintenance, secondary data research products rely on the availability of relevant, high-quality data, repeatable data handling methods and a multi-faceted data flow process to promote and sustain research transparency and integrity. The U.S. Environmental Protection Agency recently published a report describing the development of a climate resilience screening index which used over one million data points to calculate the final index. The pool of data was derived exclusively from secondary sources such as the U.S. Census Bureau, Bureau of Labor Statistics, Postal Service, Housing and Urban Development, Forestry Services and others. Available data were presented in various forms including portable document format (PDF), delimited ASCII and proprietary format (e.g., Microsoft Excel, ESRI ArcGIS). The strategy employed for managing these data in an indicator research and development effort represented a blend of business practices, information science, and the scientific method. This paper describes the approach, highlighting key points unique for managing the data assets of a smaller scale research project in an era of "big data."
公开可用数据的日益增多在许多方面改变着我们开展研究的方式。基于云的信息资源不仅为传统科学活动(如实地研究、实验室实验)提供补充数据以增强其实力,还为诸如指标开发等二次数据研究项目奠定基础。指标和指数是综合不同信息以解决难以直接衡量的复杂科学问题(如恢复力、可持续性、福祉)的便捷方式。在当前文献中,不乏从二次数据得出的指标或指数示例,且越来越多的示例具有科学重点。然而,对于用于管理支持这些工作的数据基础的管理方法和最佳实践,所提供的信息却很少。从获取到存储和维护,二次数据研究产品依赖于相关高质量数据的可用性、可重复的数据处理方法以及多方面的数据流过程,以促进和维持研究的透明度与完整性。美国环境保护局最近发布了一份报告,描述了一个气候恢复力筛选指数的开发过程,该指数使用了超过一百万个数据点来计算最终指数。数据池完全来自二次数据源,如美国人口普查局、劳工统计局、邮政服务、住房和城市发展部、林业服务局等。可用数据以各种形式呈现,包括便携式文档格式(PDF)、分隔ASCII格式和专有格式(如微软Excel、ESRI ArcGIS)。在指标研发工作中管理这些数据所采用的策略融合了商业实践、信息科学和科学方法。本文描述了该方法,突出了在“大数据”时代管理较小规模研究项目数据资产的独特要点。