Stiglic Gregor, Wang Fei, Davey Adam, Obradovic Zoran
University of Maribor, Maribor, Slovenia.
IBM T.J. Watson Research Center, Yorktown Heights, NY.
AMIA Annu Symp Proc. 2014 Nov 14;2014:1072-81. eCollection 2014.
Regulations and privacy concerns often hinder exchange of healthcare data between hospitals or other healthcare providers. Sharing predictive models built on original data and averaging their results offers an alternative to more efficient prediction of outcomes on new cases. Although one can choose from many techniques to combine outputs from different predictive models, it is difficult to find studies that try to interpret the results obtained from ensemble-learning methods.
We propose a novel approach to classification based on models from different hospitals that allows a high level of performance along with comprehensibility of obtained results. Our approach is based on regularized sparse regression models in two hierarchical levels and exploits the interpretability of obtained regression coefficients to rank the contribution of hospitals in terms of outcome prediction.
The proposed approach was used to predict the 30-days all-cause readmissions for pediatric patients in 54 Californian hospitals. Using repeated holdout evaluation, including more than 60,000 hospital discharge records, we compared the proposed approach to alternative approaches. The performance of two-level classification model was measured using the Area Under the ROC Curve (AUC) with an additional evaluation that uncovered the importance and contribution of each single data source (i.e. hospital) to the final result. The results for the best distributed model (AUC=0.787, 95% CI: 0.780-0.794) demonstrate no significant difference in terms of AUC performance when compared to a single elastic net model built on all available data (AUC=0.789, 95% CI: 0.781-0.796).
This paper presents a novel approach to improved classification with shared predictive models for environments where centralized collection of data is not possible. The significant improvements in classification performance and interpretability of results demonstrate the effectiveness of our approach.
法规和隐私问题常常阻碍医院或其他医疗服务提供者之间的医疗数据交换。共享基于原始数据构建的预测模型并对其结果进行平均,为更高效地预测新病例的结局提供了一种替代方法。尽管可以从许多技术中选择来组合不同预测模型的输出,但很难找到试图解释从集成学习方法获得的结果的研究。
我们提出了一种基于不同医院模型的新型分类方法,该方法在实现高性能的同时,还能使所得结果具有可理解性。我们的方法基于两个层次级别的正则化稀疏回归模型,并利用所得回归系数的可解释性来对医院在结局预测方面的贡献进行排名。
所提出的方法用于预测加利福尼亚州54家医院儿科患者的30天全因再入院情况。使用重复留出评估,包括超过60000份医院出院记录,我们将所提出的方法与其他替代方法进行了比较。使用ROC曲线下面积(AUC)来衡量两级分类模型的性能,并进行了额外评估,以揭示每个单一数据源(即医院)对最终结果的重要性和贡献。最佳分布模型的结果(AUC = 0.787,95% CI:0.780 - 0.794)表明,与基于所有可用数据构建的单个弹性网络模型(AUC = 0.789,95% CI:0.781 - 0.796)相比,在AUC性能方面没有显著差异。
本文提出了一种新型方法,用于在无法进行数据集中收集的环境中,通过共享预测模型改进分类。分类性能和结果可解释性的显著提高证明了我们方法的有效性。