Gormley Michael, Dampier William, Ertel Adam, Karacali Bilge, Tozeren Aydin
School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA.
BMC Bioinformatics. 2007 Oct 26;8:415. doi: 10.1186/1471-2105-8-415.
Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms.
Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform.
Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels.
同一生物学状态下独立得出的表达谱通常很少有共同基因。在本研究中,我们使用迭代机器学习算法,从与临床信息相关联的癌症(乳腺癌、淋巴瘤和肾癌)样本的公开可用微阵列数据集中创建了表达谱群体。ROC曲线用于评估每个谱对分类的预测误差。我们比较了与分子表型相关的谱和与无复发生存状态相关的谱的预测误差。将通过监督单变量特征选择算法识别的谱的预测误差与从以下两者中随机选择的谱进行比较:a)微阵列平台上的所有基因;b)已知疾病相关基因列表(先验选择)。我们还确定了来自独立数据集的测试阵列上的表达谱的相关性,这些数据集在相同或不同的微阵列平台上进行测量。
基于ER和BCL-6表达,分别在模拟基因表达数据以及乳腺癌和淋巴瘤数据集的表达数据上生成了高度有区分力的表达谱。使用无复发生存状态来识别用于预后预测的谱会导致区分力较差的决策规则。监督特征选择比随机选择或先验选择产生更准确的分类,然而,随着特征数量的增加,预测误差的差异减小。当决策规则跨数据集应用于在相同微阵列平台上进行分析的样本时,这些结果依然成立。
我们的结果表明,许多基因集能够准确预测分子表型。鉴于此,使用不同训练数据集识别的表达谱预计显示出很少的一致性。此外,我们证明了使用监督机器学习方法直接从微阵列数据预测复发的困难。这些发现与使用分子谱分析来识别候选生物标志物面板相关。