Department of Computer Science, Wayne State University, Perinatology Research Branch, NICHD/NIH, Detroit, MI 48201, USA, The Microsoft Research - University of Trento Centre for Computational and Systems Biology, Rovereto 38068, Italy, ETH Zurich, Zurich 8092, Switzerland, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA and Philip Morris International, Research & Development, Neuchâtel CH-2000, Switzerland.
Bioinformatics. 2013 Nov 15;29(22):2892-9. doi: 10.1093/bioinformatics/btt492. Epub 2013 Aug 20.
After more than a decade since microarrays were used to predict phenotype of biological samples, real-life applications for disease screening and identification of patients who would best benefit from treatment are still emerging. The interest of the scientific community in identifying best approaches to develop such prediction models was reaffirmed in a competition style international collaboration called IMPROVER Diagnostic Signature Challenge whose results we describe herein.
Fifty-four teams used public data to develop prediction models in four disease areas including multiple sclerosis, lung cancer, psoriasis and chronic obstructive pulmonary disease, and made predictions on blinded new data that we generated. Teams were scored using three metrics that captured various aspects of the quality of predictions, and best performers were awarded. This article presents the challenge results and introduces to the community the approaches of the best overall three performers, as well as an R package that implements the approach of the best overall team. The analyses of model performance data submitted in the challenge as well as additional simulations that we have performed revealed that (i) the quality of predictions depends more on the disease endpoint than on the particular approaches used in the challenge; (ii) the most important modeling factor (e.g. data preprocessing, feature selection and classifier type) is problem dependent; and (iii) for optimal results datasets and methods have to be carefully matched. Biomedical factors such as the disease severity and confidence in diagnostic were found to be associated with the misclassification rates across the different teams.
The lung cancer dataset is available from Gene Expression Omnibus (accession, GSE43580). The maPredictDSC R package implementing the approach of the best overall team is available at www.bioconductor.org or http://bioinformaticsprb.med.wayne.edu/.
自十多年前微阵列被用于预测生物样本的表型以来,用于疾病筛查和识别最受益于治疗的患者的实际应用仍在不断涌现。科学界对确定开发此类预测模型的最佳方法的兴趣在一项名为 IMPROVER 诊断签名挑战赛的国际合作竞争中得到了重申,我们在此描述了其结果。
54 个团队使用公共数据在包括多发性硬化症、肺癌、银屑病和慢性阻塞性肺疾病在内的四个疾病领域开发预测模型,并对我们生成的新盲数据进行预测。团队使用三个指标进行评分,这些指标捕获了预测质量的各个方面,表现最好的团队获得奖励。本文介绍了挑战的结果,并向社区介绍了表现最好的三个团队的整体方法,以及实现最佳整体团队方法的 R 包。在挑战中提交的模型性能数据分析以及我们进行的额外模拟表明,(i)预测的质量更多地取决于疾病终点,而不是挑战中使用的特定方法;(ii)最重要的建模因素(例如数据预处理、特征选择和分类器类型)取决于问题;(iii)为了获得最佳结果,数据集和方法必须仔细匹配。生物医学因素,如疾病严重程度和诊断置信度,被发现与不同团队的错误分类率有关。
肺癌数据集可从基因表达综合数据库(访问号,GSE43580)获得。实现最佳整体团队方法的 maPredictDSC R 包可在 www.bioconductor.org 或 http://bioinformaticsprb.med.wayne.edu/ 获得。