School of Computing, University of Southern Mississippi, Hattiesburg, Mississippi, United States of America.
PLoS One. 2010 Oct 28;5(10):e13715. doi: 10.1371/journal.pone.0013715.
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.
监测、评估和预测化学品所带来的环境风险需要快速而准确的诊断检测方法。TNT 和 RDX 等爆炸物与多种毒理效应有关。微阵列实验的一个重要目标是发现用于毒性评估的新型生物标志物。我们开发了一种含有 15208 个独特寡核苷酸探针的蚯蚓微阵列,并将其用于研究 248 条暴露于 TNT、RDX 或两者均不暴露的蚯蚓的基因表达谱。我们开发了一个新的机器学习管道,其中包含几种成熟的特征过滤/选择和分类技术,用于分析 248 个阵列数据集,以构建可以将蚯蚓样本分为三组的分类器模型:对照组、TNT 处理组和 RDX 处理组。首先,使用类比较的单变量统计算法鉴定了 869 个对 TNT 或 RDX 暴露有差异表达的基因。然后,应用基于决策树的算法选择了 354 个分类器基因的子集,这些基因按其整体重要性权重进行排序。应用多类支持向量机 (MC-SVM) 方法和无监督 K-均值聚类方法分别对分类器进行了优化,分别产生了 39 个和 30 个分类器基因的较小子集,其中 11 个共同基因是潜在的生物标志物。将这 58 个组合基因视为经过优化的子集,用于构建 MC-SVM 和聚类模型,其分类准确率分别为 83.5%和 56.9%。本研究表明,机器学习方法可用于从高维数据集中识别和优化分类器/生物标志物基因的小子集,并生成具有可接受精度的多类分类模型。