Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, USA.
BMC Bioinformatics. 2011 Dec 1;12:463. doi: 10.1186/1471-2105-12-463.
Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.
Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets.
We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.
我们的目标是研究基因特征的各个方面如何影响多基因预测模型开发的成功。我们通过改变现有探针集的表达水平将基因特征插入到三个真实数据集。我们改变了扰动的探针集数量(特征大小)、与未扰动数据相比,扰动的平均探针集表达的倍数增加(特征强度)和扰动的样本数量。训练预测模型以识别哪些病例被扰动。使用蒙特卡罗交叉验证估计性能。
特征强度对预测器性能的影响最大。如果平均表达值的差异倍数 > 2,即使被干扰的样本代表所有样本的 10%,也可以用多达 10 个特征来开发几乎完美的预测器。我们还评估了 9 个真实临床预测问题在 6 个不同乳腺癌数据集中的基因特征集大小和强度。
我们仅在区分 ER 阳性和 ER 阴性癌症时发现了足够大且强的预测特征,对于更微妙的预测问题没有强特征。如果存在这样的特征,当前的统计方法可以有效地识别基因表达数据中的高度信息特征,并且可以使用多达 10 个高度信息特征来构建准确的模型。如果在比较组之间存在至少 2 倍的表达差异,则可以认为特征是高度信息的,但在人类数据集中,许多与临床相关的预测问题似乎并不常见。