Dagliati Arianna, Marini Simone, Sacchi Lucia, Cogni Giulia, Teliti Marsida, Tibollo Valentina, De Cata Pasquale, Chiovato Luca, Bellazzi Riccardo
1 Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
2 Centre for Health Technologies, University of Pavia, Pavia, Italy.
J Diabetes Sci Technol. 2018 Mar;12(2):295-302. doi: 10.1177/1932296817706375. Epub 2017 May 12.
One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical practice.
人工智能产生较大影响的领域之一是机器学习,它开发能够从数据中学习模式和决策规则的算法。机器学习算法已被嵌入到数据挖掘流程中,该流程可将它们与经典统计策略相结合,以便从数据中提取知识。在欧盟资助的MOSAIC项目中,一个数据挖掘流程已被用于基于近千名患者的电子健康记录数据得出一组2型糖尿病(T2DM)并发症的预测模型。这样的流程包括临床中心概况分析、预测模型靶向、预测模型构建和模型验证。在通过随机森林(RF)处理缺失数据并应用合适的策略来处理类别不平衡之后,我们使用逐步特征选择的逻辑回归来预测在不同时间场景下,即从首次就诊于糖尿病医院中心(而非从诊断时起算)的3年、5年和7年后,视网膜病变、神经病变或肾病的发病情况。考虑的变量有性别、年龄、诊断后的时间、体重指数(BMI)、糖化血红蛋白(HbA1c)、高血压和吸烟习惯。根据并发症量身定制的最终模型,准确率高达0.838。针对每种并发症和时间场景选择了不同的变量,从而得出易于转化为临床实践的专门模型。