van Vliet Martin H, Klijn Christiaan N, Wessels Lodewyk F A, Reinders Marcel J T
Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands.
PLoS One. 2007 Oct 17;2(10):e1047. doi: 10.1371/journal.pone.0001047.
The availability of large collections of microarray datasets (compendia), or knowledge about grouping of genes into pathways (gene sets), is typically not exploited when training predictors of disease outcome. These can be useful since a compendium increases the number of samples, while gene sets reduce the size of the feature space. This should be favorable from a machine learning perspective and result in more robust predictors.
We extracted modules of regulated genes from gene sets, and compendia. Through supervised analysis, we constructed predictors which employ modules predictive of breast cancer outcome. To validate these predictors we applied them to independent data, from the same institution (intra-dataset), and other institutions (inter-dataset).
We show that modules derived from single breast cancer datasets achieve better performance on the validation data compared to gene-based predictors. We also show that there is a trend in compendium specificity and predictive performance: modules derived from a single breast cancer dataset, and a breast cancer specific compendium perform better compared to those derived from a human cancer compendium. Additionally, the module-based predictor provides a much richer insight into the underlying biology. Frequently selected gene sets are associated with processes such as cell cycle, E2F regulation, DNA damage response, proteasome and glycolysis. We analyzed two modules related to cell cycle, and the OCT1 transcription factor, respectively. On an individual basis, these modules provide a significant separation in survival subgroups on the training and independent validation data.
在训练疾病预后预测模型时,通常未利用大量微阵列数据集(汇编)的可用性,或关于基因分组到通路(基因集)的知识。这些可能是有用的,因为汇编增加了样本数量,而基因集减小了特征空间的大小。从机器学习的角度来看,这应该是有利的,并能产生更稳健的预测模型。
我们从基因集和汇编中提取了受调控基因的模块。通过监督分析,我们构建了使用预测乳腺癌预后的模块的预测模型。为了验证这些预测模型,我们将它们应用于来自同一机构(数据集内)和其他机构(数据集间)的独立数据。
我们表明,与基于基因的预测模型相比,从单个乳腺癌数据集中衍生的模块在验证数据上表现更好。我们还表明,在汇编特异性和预测性能方面存在一种趋势:与从人类癌症汇编中衍生的模块相比,从单个乳腺癌数据集和乳腺癌特异性汇编中衍生的模块表现更好。此外,基于模块的预测模型能更深入地洞察潜在生物学机制。经常被选择的基因集与细胞周期、E2F调控、DNA损伤反应、蛋白酶体和糖酵解等过程相关。我们分别分析了与细胞周期和OCT1转录因子相关的两个模块。就个体而言,这些模块在训练和独立验证数据上的生存亚组中提供了显著的区分。