Phan John H, Yin-Goen Qiqin, Young Andrew N, Wang May D
Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, 313 Ferst Drive, Atlanta, GA 30332, USA.
Pac Symp Biocomput. 2009:427-38.
Identifying and validating biomarkers from high-throughput gene expression data is important for understanding and treating cancer. Typically, we identify candidate biomarkers as features that are differentially expressed between two or more classes of samples. Many feature selection metrics rely on ranking by some measure of differential expression. However, interpreting these results is difficult due to the large variety of existing algorithms and metrics, each of which may produce different results. Consequently, a feature ranking metric may work well on some datasets but perform considerably worse on others. We propose a method to choose an optimal feature ranking metric on an individual dataset basis. A metric is optimal if, for a particular dataset, it favorably ranks features that are known to be relevant biomarkers. Extensive knowledge of biomarker candidates is available in public databases and literature. Using this knowledge, we can choose a ranking metric that produces the most biologically meaningful results. In this paper, we first describe a framework for assessing the ability of a ranking metric to detect known relevant biomarkers. We then apply this method to clinical renal cancer microarray data to choose an optimal metric and identify several candidate biomarkers.
从高通量基因表达数据中识别和验证生物标志物对于理解和治疗癌症至关重要。通常,我们将候选生物标志物识别为在两类或多类样本之间差异表达的特征。许多特征选择指标依赖于通过某种差异表达度量进行排序。然而,由于现有算法和指标种类繁多,每种算法和指标可能产生不同的结果,因此解释这些结果很困难。因此,一个特征排序指标在某些数据集上可能表现良好,但在其他数据集上的表现可能会差很多。我们提出了一种基于单个数据集选择最优特征排序指标的方法。如果对于特定数据集,某个指标能对已知为相关生物标志物的特征进行有利排序,那么该指标就是最优的。在公共数据库和文献中可以获取关于候选生物标志物的广泛知识。利用这些知识,我们可以选择产生最具生物学意义结果的排序指标。在本文中,我们首先描述一个用于评估排序指标检测已知相关生物标志物能力的框架。然后我们将此方法应用于临床肾癌微阵列数据,以选择最优指标并识别几个候选生物标志物。