Bø Trond, Jonassen Inge
Department of Informatics, University of Bergen, N-5020 Bergen, Norway.
Genome Biol. 2002;3(4):RESEARCH0017. doi: 10.1186/gb-2002-3-4-research0017. Epub 2002 Mar 14.
Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier.
We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes.
When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.
目前,从微阵列实验产生的数据集中提取有用信息的方法备受关注。在此,我们提出了一些新方法,用于寻找非常适合区分实验类别(如健康组织与患病组织)的基因集。我们的方法基于对基因进行成对评估,并评估一对基因组合区分两个实验类别的能力。我们测试了基于成对的方法选择能够概括实验类别之间差异的基因集的能力,并将其性能与两种标准方法进行了比较。为了评估概括类别差异的能力,我们研究了我们选择的基因集在学习分类器方面的适用性。
我们表明,就学习到的分类器的交叉验证预测准确性而言,我们的方法选择的基因集优于标准方法,在某些情况下优势明显。我们表明,在两个公共数据集上,仅使用15 - 30个基因就可以做出准确的诊断。我们的结果对于如何选择标记基因以及诊断需要进行多少基因测量具有启示意义。
在寻找实验类别之间的差异表达时,单独考察每个基因可能并不足够。评估基因组合会揭示出用其他方式无法发现的有趣信息。我们的结果表明,利用这些额外信息可以提高类别预测能力。