School of Plant Sciences, University of Arizona, Tucson, Arizona 85721-0036.
Plant Cell. 2014 Feb;26(2):520-37. doi: 10.1105/tpc.113.121913. Epub 2014 Feb 11.
Machine learning (ML) is an intelligent data mining technique that builds a prediction model based on the learning of prior knowledge to recognize patterns in large-scale data sets. We present an ML-based methodology for transcriptome analysis via comparison of gene coexpression networks, implemented as an R package called machine learning-based differential network analysis (mlDNA) and apply this method to reanalyze a set of abiotic stress expression data in Arabidopsis thaliana. The mlDNA first used a ML-based filtering process to remove nonexpressed, constitutively expressed, or non-stress-responsive "noninformative" genes prior to network construction, through learning the patterns of 32 expression characteristics of known stress-related genes. The retained "informative" genes were subsequently analyzed by ML-based network comparison to predict candidate stress-related genes showing expression and network differences between control and stress networks, based on 33 network topological characteristics. Comparative evaluation of the network-centric and gene-centric analytic methods showed that mlDNA substantially outperformed traditional statistical testing-based differential expression analysis at identifying stress-related genes, with markedly improved prediction accuracy. To experimentally validate the mlDNA predictions, we selected 89 candidates out of the 1784 predicted salt stress-related genes with available SALK T-DNA mutagenesis lines for phenotypic screening and identified two previously unreported genes, mutants of which showed salt-sensitive phenotypes.
机器学习 (ML) 是一种智能数据挖掘技术,它基于先验知识的学习构建预测模型,以识别大规模数据集的模式。我们提出了一种基于 ML 的转录组分析方法,通过比较基因共表达网络来实现,该方法实现为一个名为基于机器学习的差异网络分析 (mlDNA) 的 R 包,并将该方法应用于重新分析一组拟南芥的非生物胁迫表达数据。mlDNA 首先使用基于 ML 的过滤过程在网络构建之前去除非表达、组成型表达或非胁迫响应的“非信息”基因,通过学习已知与胁迫相关的基因的 32 个表达特征的模式。保留的“信息”基因随后通过基于 ML 的网络比较进行分析,以根据 33 个网络拓扑特征预测候选与胁迫相关的基因,这些基因在对照和胁迫网络之间表现出表达和网络差异。对网络中心和基因中心分析方法的比较评估表明,mlDNA 在识别与胁迫相关的基因方面明显优于传统基于统计检验的差异表达分析,具有显著提高的预测准确性。为了实验验证 mlDNA 的预测,我们从 1784 个预测的盐胁迫相关基因中选择了 89 个具有可用 SALK T-DNA 诱变系的候选基因进行表型筛选,并鉴定出两个以前未报道的基因,其突变体表现出盐敏感表型。