Vargas-Rojas Luis, Ting To-Chia, Rainey Katherine M, Reynolds Matthew, Wang Diane R
Department of Agronomy, Purdue University, West Lafayette, IN, United States.
Wheat Physiology Group, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico.
Front Plant Sci. 2024 Feb 21;15:1265073. doi: 10.3389/fpls.2024.1265073. eCollection 2024.
Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an -- (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.
表型分析技术的进步使植物科学研究人员能够从他们的实验中收集大量信息,特别是那些评估多个基因型的实验。为了充分利用这些复杂且通常异构的数据集(即格式和结构不同的数据集),科学家们必须在数据处理上投入大量时间,而数据管理已成为下游应用的一个重大障碍。在此,我们提出了一个管道,以增强植物科学研究中的数据收集、处理和管理,该管道由两个新开发的开源程序组成。第一个程序称为AgTC,它是一系列编程函数,可生成逗号分隔值文件模板,以便使用基于实验室的计算机或移动设备以标准格式收集数据。第二个函数系列AgETL执行提取、转换和加载(ETL)数据集成过程的步骤,即从异构格式的文件中提取数据,进行转换以符合标准标准,然后加载到数据库中。在数据库中,数据被存储起来,并可用于与数据分析相关的过程,包括通过基于网络的工具进行动态数据可视化。AgTC和AgETL都可以灵活应用于各种植物科学实验,领域科学家无需具备编程知识,并且它们的功能在基于浏览器的交互式开发环境Jupyter Notebook上执行。此外,所有参数都可以通过以人类可读的YAML格式编写的中央配置文件轻松定制。我们以大学研究实验室和非政府组织(NGO)环境中的三个实验作为测试案例,展示了AgTC和AgETL在简化植物科学中从数据收集到分析的关键步骤方面的效用。