Suppr超能文献

用于微阵列分类的可靠基因特征:稳定性和性能评估

Reliable gene signatures for microarray classification: assessment of stability and performance.

作者信息

Davis Chad A, Gerick Fabian, Hintermair Volker, Friedel Caroline C, Fundel Katrin, Küffner Robert, Zimmer Ralf

机构信息

Institute of Informatics, Ludwig-Maximilians-Universität München, Amalienstrasse 17 80333 Munich, Germany.

出版信息

Bioinformatics. 2006 Oct 1;22(19):2356-63. doi: 10.1093/bioinformatics/btl400. Epub 2006 Jul 31.

Abstract

MOTIVATION

Two important questions for the analysis of gene expression measurements from different sample classes are (1) how to classify samples and (2) how to identify meaningful gene signatures (ranked gene lists) exhibiting the differences between classes and sample subsets. Solutions to both questions have immediate biological and biomedical applications. To achieve optimal classification performance, a suitable combination of classifier and gene selection method needs to be specifically selected for a given dataset. The selected gene signatures can be unstable and the resulting classification accuracy unreliable, particularly when considering different subsets of samples. Both unstable gene signatures and overestimated classification accuracy can impair biological conclusions.

METHODS

We address these two issues by repeatedly evaluating the classification performance of all models, i.e. pairwise combinations of various gene selection and classification methods, for random subsets of arrays (sampling). A model score is used to select the most appropriate model for the given dataset. Consensus gene signatures are constructed by extracting those genes frequently selected over many samplings. Sampling additionally permits measurement of the stability of the classification performance for each model, which serves as a measure of model reliability.

RESULTS

We analyzed a large gene expression dataset with 78 measurements of four different cartilage sample classes. Classifiers trained on subsets of measurements frequently produce models with highly variable performance. Our approach provides reliable classification performance estimates via sampling. In addition to reliable classification performance, we determined stable consensus signatures (i.e. gene lists) for sample classes. Manual literature screening showed that these genes are highly relevant to our gene expression experiment with osteoarthritic cartilage. We compared our approach to others based on a publicly available dataset on breast cancer.

AVAILABILITY

R package at http://www.bio.ifi.lmu.de/~davis/edaprakt

摘要

动机

分析来自不同样本类别的基因表达测量数据时,有两个重要问题:(1)如何对样本进行分类;(2)如何识别能够展现类别与样本子集间差异的有意义的基因特征(排名基因列表)。这两个问题的解决方案都有直接的生物学和生物医学应用。为实现最佳分类性能,需要针对给定数据集专门选择分类器和基因选择方法的合适组合。所选的基因特征可能不稳定,由此得出的分类准确性也不可靠,尤其是在考虑样本的不同子集时。不稳定的基因特征和高估的分类准确性都会损害生物学结论。

方法

我们通过反复评估所有模型(即各种基因选择和分类方法的两两组合)对随机阵列子集(抽样)的分类性能来解决这两个问题。使用模型分数为给定数据集选择最合适的模型。通过提取在多次抽样中频繁被选中的基因来构建共识基因特征。抽样还允许测量每个模型分类性能的稳定性,以此作为模型可靠性的一种度量。

结果

我们分析了一个包含四个不同软骨样本类别的78次测量的大型基因表达数据集。在测量子集上训练的分类器经常产生性能差异很大的模型。我们的方法通过抽样提供可靠的分类性能估计。除了可靠的分类性能外,我们还确定了样本类别的稳定共识特征(即基因列表)。人工文献筛选表明,这些基因与我们关于骨关节炎软骨的基因表达实验高度相关。我们基于一个公开的乳腺癌数据集将我们的方法与其他方法进行了比较。

可用性

R包可在http://www.bio.ifi.lmu.de/~davis/edaprakt获取

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验