Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America.
King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, Saudi Arabia.
PLoS Comput Biol. 2018 Apr 26;14(4):e1006106. doi: 10.1371/journal.pcbi.1006106. eCollection 2018 Apr.
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
匿名电子病历是越来越受欢迎的研究数据来源。然而,这些数据集通常缺乏种族和族裔信息。这给建模人类疾病的研究人员带来了问题,因为种族和族裔是许多健康暴露和治疗结果的有力混杂因素;种族和族裔与特定人群的遗传变异密切相关。我们表明,深度神经网络生成的缺失种族和族裔信息的估计值比竞争方法(例如逻辑回归、随机森林、支持向量机和梯度提升决策树)更准确。RIDDLE 在所有被认为的指标上都表现出了显著更好的分类性能:准确性、交叉熵损失(误差)、精度、召回率和接收者操作特征图下的面积(所有 p < 10-9)。我们特别努力解释训练后的神经网络模型,以识别、量化和可视化对种族和族裔有预测性的医学特征。我们使用这些信息特征的描述来对不同种族和族裔的疾病模式进行系统比较。临床病史对推断种族和族裔具有信息性,这可能反映了以下几个方面:(1)蓝领和白领职业在种族和族裔群体中的分布不均;(2)预防性健康的可及性和主观重要性的不均衡;(3)生活方式的可能变化,如饮食习惯;(4)导致疾病的背景遗传变异的差异。