Cavill Rachel, Keun Hector C, Holmes Elaine, Lindon John C, Nicholson Jeremy K, Ebbels Timothy M D
Department of Biomolecular Medicine, Division of Surgery, Oncology, Reproductive Biology and Anaesthetics, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, South Kensington, London, SW7 2AZ, UK.
Bioinformatics. 2009 Jan 1;25(1):112-8. doi: 10.1093/bioinformatics/btn586. Epub 2008 Nov 14.
Metabolic profiles derived from high resolution (1)H-NMR data are complex, therefore statistical and machine learning approaches are vital for extracting useful information and biological insights. Focused modelling on targeted subsets of metabolites and samples can improve the predictive ability of models, and techniques such as genetic algorithms (GAs) have a proven utility in feature selection problems. The Consortium for Metabonomic Toxicology (COMET) obtained temporal NMR spectra of urine from rats treated with model toxins and stressors. Here, we develop a GA approach which simultaneously selects sets of samples and spectral regions from the COMET database to build robust, predictive classifiers of liver and kidney toxicity.
The results indicate that using simultaneous sample and variable selection improved performance by over 9% compared with either method alone. Simultaneous selection also halved computation time. Successful classifiers repeatedly selected particular variables indicating that this approach can aid defining biomarkers of toxicity. Novel visualizations of the results from multiple computations were developed to aid the interpretability of which samples and variables were frequently selected. This method provides an efficient way to determine the most discriminatory variables and samples for any post-genomic dataset.
GA code available from http://www1.imperial.ac.uk/medicine/people/r.cavill/
源自高分辨率¹H-NMR数据的代谢谱很复杂,因此统计和机器学习方法对于提取有用信息和生物学见解至关重要。针对代谢物和样本的目标子集进行聚焦建模可以提高模型的预测能力,而诸如遗传算法(GA)等技术在特征选择问题中已被证明具有实用性。代谢组学毒理学联盟(COMET)获取了用模型毒素和应激源处理的大鼠尿液的时间分辨NMR光谱。在此,我们开发了一种GA方法,该方法同时从COMET数据库中选择样本集和光谱区域,以构建强大的肝脏和肾脏毒性预测分类器。
结果表明,与单独使用任何一种方法相比,同时进行样本和变量选择可将性能提高9%以上。同时选择还将计算时间减半。成功的分类器反复选择特定变量,表明该方法有助于确定毒性生物标志物。开发了多种计算结果的新颖可视化方法,以帮助解释哪些样本和变量被频繁选择。该方法为确定任何后基因组数据集最具区分性的变量和样本提供了一种有效方法。
GA代码可从http://www1.imperial.ac.uk/medicine/people/r.cavill/获取