Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China.
Department of Surgical Oncology, Second affiliated hospital, Zhejiang University School of Medicine, Hangzhou, China.
Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.
The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version.
In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information.
The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach.
The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
预测模型的准确性已成为现代医学中临床医生所做的与健康相关决策的质量和可靠性的重要方面。不幸的是,个别机构通常缺乏足够的样本,这可能无法为模型提供足够的统计能力。一种缓解方法是将数据收集从单个机构扩展到多个中心,以共同增加样本量。然而,共享用于研究的敏感生物医学数据涉及复杂的问题。随机森林(RF)等机器学习模型虽然常用于预后预测,并且表现良好,但在多中心隐私保护数据挖掘场景下的性能通常比集中训练版本差。
本研究提出了一种多中心随机森林预后预测模型,该模型允许从水平分割的数据集中进行联合临床数据挖掘。通过使用一种基于针对临床预后数据定制的差分隐私生成对抗网络的新颖数据增强方法,所提出的模型能够为多中心 RF 模型提供与集中训练的 RF 模型相当甚至更好的性能,而无需聚合原始数据。此外,我们的模型还结合了一个重要性排名步骤,用于在不共享患者级信息的情况下进行特征选择。
在所提出的模型中,对来自美国和中国的结直肠癌数据集进行了评估。选择了两组具有不同协作研究网络内异质性水平的数据集。首先,我们比较了在不同隐私参数下,不同增强数据集比例下分布式随机森林模型的性能,并验证了我们方法的有效性和合理性。然后,我们将所提出的多中心随机森林与集中训练的随机森林模型以及其他基于树的分类器以及一些常用的机器学习方法进行了比较。结果表明,在所提出的模型中,与集中训练的 RF 模型或其他候选模型相比,所提出的模型可以在两组数据中都遵循隐私保护规则的情况下,提供更好的预测性能,并且具有更好的判别和校准能力。此外,在所提出的方法中基于特征重要性排名的简化模型上也显示出了良好的判别和校准能力。
所提出的随机森林模型使用多中心临床数据展示了理想的预测能力,并克服了隐私保护带来的性能限制。它还可以提供跨机构的特征重要性排名,而无需在中央站点汇集数据。本研究为构建协作临床研究网络中的预后预测模型提供了一种实用的解决方案,并解决了医学人工智能实际应用中的实际问题。