IBM TJ Watson Research, Yorktown Heights, New York, USA.
PLoS Comput Biol. 2013;9(5):e1003047. doi: 10.1371/journal.pcbi.1003047. Epub 2013 May 9.
Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models.
乳腺癌是女性最常见的恶性肿瘤,每年导致数十万人死亡。与大多数癌症一样,它是一种异质性疾病,不同的乳腺癌亚型需要不同的治疗方法。了解基于分子和表型特征的乳腺癌预后差异是通过将适当的治疗方法与疾病的分子亚型相匹配来改善治疗的一种途径。在这项工作中,我们采用了基于竞争的方法,使用包含基因组和临床信息的大型数据集以及在线实时排行榜程序来对乳腺癌预后进行建模,该程序用于向建模团队提供快速反馈,并鼓励每位建模者努力实现排名更高的提交。我们发现,与当前最佳方法相比,机器学习方法与基于专家先验知识选择的分子特征相结合可以提高生存预测,并且跨多个用户提交训练的集成模型系统优于集成中的单个模型。我们还发现,模型评分在多个独立评估中高度一致。这项研究是一项向整个研究社区开放的更大规模竞赛的试点阶段,目的是了解使用临床和分子分析数据进行模型优化的一般策略,并提供一个客观、透明的预后模型评估系统。