Preo Nicolo', Capobianco Enrico
Bip xScience, Milan, Italy.
Center for Computational Science, University of Miami, Miami, FL, United States.
Front Big Data. 2019 Sep 27;2:30. doi: 10.3389/fdata.2019.00030. eCollection 2019.
Electronic health records (EHR) play an important role for the redefinition of phenotypes in view of the wealth and heterogeneity of information now available from disparate data sources. A recent cross-sectional retrospective study has described the potential of EHR toward type 2 diabetes mellitus (T2D) screening when models are used. About 10,000 US patients have been analyzed through a variety of inference techniques applied to all records with a variable degree of completeness. The analyses conducted in the reference study have indicated that EHR phenotypes significantly improved T2D detection. With these US patients and the T2D data evidenced in the above study, we propose an integrative inference approach that leverages the prediction power of EHR features selected by two well-known methods, Random Forests and Lasso. The goal is 2-fold: reducing the Big Data redundancies potentially harmful to the predictive learning task and exploiting the interconnectivity of EHR features. A mutual information (MI) network is the inference tool used to identify communities useful to prioritize significant T2D features underlying the similarity between patients. Endowed with a different degree of granularity, the communities detected after the application of both methods were centered especially on T2D comorbidities and risk factors. As such, they appear very relevant for assessment of two main issues, T2D disease burden, and prevention. Our analytical approach offers a solution for managing the EHR scale factor in a complex disease context. EHR are rich sources of phenotypic diversity through which novel stratifications of patients are expected. To enable these results, both pre-screening of variables and calibration of risk prediction methods become necessary steps in EHR analyses. We have presented networks identifying major T2D communities. The specific significance assigned to comorbidities and risk factors in relation to T2D can be inferred with accuracy from just a suitably reduced number of EHR features.
鉴于目前可从不同数据源获得丰富且异质的信息,电子健康记录(EHR)在重新定义表型方面发挥着重要作用。最近一项横断面回顾性研究描述了在使用模型时EHR用于2型糖尿病(T2D)筛查的潜力。通过应用于所有具有不同完整程度记录的各种推理技术,对约10,000名美国患者进行了分析。参考研究中的分析表明,EHR表型显著改善了T2D检测。利用上述研究中的这些美国患者和T2D数据,我们提出了一种综合推理方法,该方法利用了通过随机森林和套索这两种著名方法选择的EHR特征的预测能力。目标有两个:减少可能对预测学习任务有害的大数据冗余,并利用EHR特征的相互关联性。互信息(MI)网络是用于识别有助于对患者之间相似性基础上的重要T2D特征进行优先级排序的社区的推理工具。应用这两种方法后检测到的社区具有不同程度的粒度,尤其集中在T2D合并症和危险因素上。因此,它们对于评估两个主要问题,即T2D疾病负担和预防,显得非常相关。我们的分析方法为在复杂疾病背景下管理EHR规模因素提供了一种解决方案。EHR是表型多样性的丰富来源,有望通过它实现对患者的新分层。为了实现这些结果,变量的预筛选和风险预测方法的校准都成为EHR分析中的必要步骤。我们展示了识别主要T2D社区的网络。仅从适当减少数量的EHR特征中,就可以准确推断出与T2D相关的合并症和危险因素的具体重要性。