Ontario Institute for Cancer Research, Toronto, ON, Canada.
Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.
PLoS One. 2018 Sep 14;13(9):e0204123. doi: 10.1371/journal.pone.0204123. eCollection 2018.
Biomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer.
We risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no 'best' preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.
Performance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.
生物标志物是精准医学的关键组成部分。然而,生物标志物的完全临床整合遇到了挑战,部分归因于分析困难。已经表明,生物标志物的可重复性容易受到数据预处理方法的影响。在这里,我们系统地评估了机器学习预处理方法的集成作为一种提高预测早期乳腺癌患者生存的生物标志物性能的通用策略。
我们根据四个已发表的缺氧特征(Buffa、Winter、Hu 和 Sorensen),将乳腺癌患者分为低风险或高风险组,使用 24 种不同的微阵列归一化预处理方法。使用随机森林组合每个缺氧特征确定的 24 个二进制风险特征,以评估预处理集成分类器的效果。我们证明,合并预处理方法的最佳方法因特征而异,并且不太可能存在适用于所有数据集的“最佳”预处理管道,这突出表明需要评估预处理算法的集成。此外,我们为每种预处理方法开发了新的特征,并且将每种方法的风险分类纳入了元随机森林模型中。有趣的是,这些生物标志物的分类及其集成显示出惊人的一致性,表明正在忠实地表示类似的内在生物学信息。因此,这些分类模式进一步证实存在一组患者,其预后一直难以预测。
不同预后特征的性能随预处理方法而变化。通过一致投票对分类进行简单分类是改进单一预处理方法的可靠方法。未来的特征可能需要整合内在和外在临床病理变量,以更好地预测与疾病相关的结果。