Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
Department of Statistics, Faculty of Science, Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj, 8100, Bangladesh.
Sci Rep. 2019 Dec 20;9(1):19526. doi: 10.1038/s41598-019-55609-6.
Statistical data-mining (DM) and machine learning (ML) are promising tools to assist in the analysis of complex dataset. In recent decades, in the precision of agricultural development, plant phenomics study is crucial for high-throughput phenotyping of local crop cultivars. Therefore, integrated or a new analytical approach is needed to deal with these phenomics data. We proposed a statistical framework for the analysis of phenomics data by integrating DM and ML methods. The most popular supervised ML methods; Linear Discriminant Analysis (LDA), Random Forest (RF), Support Vector Machine with linear (SVM-l) and radial basis (SVM-r) kernel are used for classification/prediction plant status (stress/non-stress) to validate our proposed approach. Several simulated and real plant phenotype datasets were analyzed. The results described the significant contribution of the features (selected by our proposed approach) throughout the analysis. In this study, we showed that the proposed approach removed phenotype data analysis complexity, reduced computational time of ML algorithms, and increased prediction accuracy.
统计数据挖掘(DM)和机器学习(ML)是辅助分析复杂数据集的有前途的工具。近几十年来,在农业发展的精确性方面,植物表型组学研究对于当地作物品种的高通量表型分析至关重要。因此,需要综合或新的分析方法来处理这些表型组学数据。我们提出了一个通过整合 DM 和 ML 方法来分析表型组学数据的统计框架。最受欢迎的监督 ML 方法;线性判别分析(LDA)、随机森林(RF)、带有线性(SVM-l)和径向基(SVM-r)核的支持向量机,用于对植物状态(胁迫/非胁迫)进行分类/预测,以验证我们提出的方法。分析了几个模拟和真实的植物表型数据集。结果描述了特征(由我们提出的方法选择)在整个分析过程中的重要贡献。在这项研究中,我们表明,所提出的方法降低了表型数据分析的复杂性,减少了 ML 算法的计算时间,并提高了预测精度。