Kar Soumyashree, Garin Vincent, Kholová Jana, Vadez Vincent, Durbha Surya S, Tanaka Ryokei, Iwata Hiroyoshi, Urban Milan O, Adinarayana J
Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay, Mumbai, India.
Crop Physiology, International Crop Research Institute for Semi-Arid Tropics (ICRISAT), Hyderabad, India.
Front Plant Sci. 2020 Nov 20;11:552509. doi: 10.3389/fpls.2020.552509. eCollection 2020.
The rapid development of phenotyping technologies over the last years gave the opportunity to study plant development over time. The treatment of the massive amount of data collected by high-throughput phenotyping (HTP) platforms is however an important challenge for the plant science community. An important issue is to accurately estimate, over time, the genotypic component of plant phenotype. In outdoor and field-based HTP platforms, phenotype measurements can be substantially affected by data-generation inaccuracies or failures, leading to erroneous or missing data. To solve that problem, we developed an analytical pipeline composed of three modules: detection of outliers, imputation of missing values, and mixed-model genotype adjusted means computation with spatial adjustment. The pipeline was tested on three different traits (3D leaf area, projected leaf area, and plant height), in two crops (chickpea, sorghum), measured during two seasons. Using real-data analyses and simulations, we showed that the sequential application of the three pipeline steps was particularly useful to estimate smooth genotype growth curves from raw data containing a large amount of noise, a situation that is potentially frequent in data generated on outdoor HTP platforms. The procedure we propose can handle up to 50% of missing values. It is also robust to data contamination rates between 20 and 30% of the data. The pipeline was further extended to model the genotype time series data. A change-point analysis allowed the determination of growth phases and the optimal timing where genotypic differences were the largest. The estimated genotypic values were used to cluster the genotypes during the optimal growth phase. Through a two-way analysis of variance (ANOVA), clusters were found to be consistently defined throughout the growth duration. Therefore, we could show, on a wide range of scenarios, that the pipeline facilitated efficient extraction of useful information from outdoor HTP platform data. High-quality plant growth time series data is also provided to support breeding decisions. The R code of the pipeline is available at https://github.com/ICRISAT-GEMS/SpaTemHTP.
在过去几年中,表型分析技术的快速发展为研究植物随时间的发育提供了机会。然而,处理高通量表型分析(HTP)平台收集的大量数据是植物科学界面临的一项重要挑战。一个重要问题是随着时间的推移准确估计植物表型的基因型成分。在室外和基于田间的HTP平台上,表型测量可能会受到数据生成不准确或失败的严重影响,从而导致错误数据或缺失数据。为了解决这个问题,我们开发了一个由三个模块组成的分析流程:异常值检测、缺失值插补以及具有空间调整的混合模型基因型调整均值计算。该流程在两个季节测量的两种作物(鹰嘴豆、高粱)的三个不同性状(三维叶面积、投影叶面积和株高)上进行了测试。通过实际数据分析和模拟,我们表明这三个流程步骤的顺序应用对于从包含大量噪声的原始数据中估计平滑的基因型生长曲线特别有用,这种情况在室外HTP平台生成的数据中可能经常出现。我们提出的程序可以处理高达50%的缺失值。它对于数据污染率在数据的20%至30%之间的情况也具有鲁棒性。该流程进一步扩展以对基因型时间序列数据进行建模。通过变点分析可以确定生长阶段以及基因型差异最大的最佳时间点。估计的基因型值用于在最佳生长阶段对基因型进行聚类。通过双向方差分析(ANOVA),发现聚类在整个生长期间都能一致地定义。因此,在广泛的场景中,我们可以表明该流程有助于从室外HTP平台数据中高效提取有用信息。还提供了高质量的植物生长时间序列数据以支持育种决策。该流程的R代码可在https://github.com/ICRISAT-GEMS/SpaTemHTP获取。