Soranno Patricia A, Bissell Edward G, Cheruvelil Kendra S, Christel Samuel T, Collins Sarah M, Fergus C Emi, Filstrup Christopher T, Lapierre Jean-Francois, Lottig Noah R, Oliver Samantha K, Scott Caren E, Smith Nicole J, Stopyak Scott, Yuan Shuai, Bremigan Mary Tate, Downing John A, Gries Corinna, Henry Emily N, Skaff Nick K, Stanley Emily H, Stow Craig A, Tan Pang-Ning, Wagner Tyler, Webster Katherine E
Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI 48824 USA.
Center for Limnology, University of Wisconsin-Madison, Madison, WI 53706 USA.
Gigascience. 2015 Jul 1;4:28. doi: 10.1186/s13742-015-0067-4. eCollection 2015.
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
尽管有大量基于单个或一组生态系统的实地数据,但这些数据集分布广泛,数据格式和规范各不相同,且通常获取受限。在更广泛的尺度上,存在大量关于土地、水和空气的地理空间特征的国家数据集,这些对于全面了解这些生态系统之间的差异是必需的。然而,此类数据集来源不同,空间和时间分辨率也各异。从开放科学的角度出发,通过整合基于实地的生态系统数据集和国家地理空间数据集,科学界能够提出与大规模环境挑战相关的重要研究问题。建议通过同行评审论文记录此类复杂的数据库整合工作,以促进整合数据库的可重复性和未来使用。在此,我们描述了构建一个湖泊生态系统综合数据库(称为LAGOS,即湖泊多尺度地理空间和时间数据库)的主要步骤、挑战和注意事项,该数据库是在美国17个州(面积180万平方公里)的次大陆研究范围内开发的。LAGOS包括两个模块:LAGOSGEO,包含研究范围内每个表面积大于4公顷的湖泊(约50,000个湖泊)的地理空间数据,包括在一系列空间和时间范围内测量的气候、大气沉降、土地利用/覆盖、水文、地质和地形;以及LAGOSLIMNO,包含从约100个单独数据集汇编而来的研究范围内一部分湖泊(约10,000个湖泊)的湖水水质数据。数据集整合程序包括:创建灵活的数据库设计;编写和整合元数据;记录数据来源;量化地理数据的空间度量;对整合数据和派生数据进行质量控制;以及对数据库进行广泛记录。我们的程序使一个大型、复杂的综合数据库具有可重复性和可扩展性,允许用户利用现有数据库或通过添加新数据提出新的研究问题。这项任务最大的挑战是数据、格式和元数据的异质性。数据整合的许多步骤需要不同领域专家的人工输入,这需要密切协作。