Graduate Institute of Biomedical Informatics, Taipei Medical University, Taiwan.
Department of Information Management, National Taipei University of Nursing and Health Science, Taiwan; Master Program in Global Health and Development, Taipei Medical University, Taipei, Taiwan.
Comput Methods Programs Biomed. 2016 Mar;125:58-65. doi: 10.1016/j.cmpb.2015.11.009. Epub 2015 Nov 27.
Diabetes mellitus is associated with an increased risk of liver cancer, and these two diseases are among the most common and important causes of morbidity and mortality in Taiwan.
To use data mining techniques to develop a model for predicting the development of liver cancer within 6 years of diagnosis with type II diabetes.
Data were obtained from the National Health Insurance Research Database (NHIRD) of Taiwan, which covers approximately 22 million people. In this study, we selected patients who were newly diagnosed with type II diabetes during the 2000-2003 periods, with no prior cancer diagnosis. We then used encrypted personal ID to perform data linkage with the cancer registry database to identify whether these patients were diagnosed with liver cancer. Finally, we identified 2060 cases and assigned them to a case group (patients diagnosed with liver cancer after diabetes) and a control group (patients with diabetes but no liver cancer). The risk factors were identified from the literature review and physicians' suggestion, then, chi-square test was conducted on each independent variable (or potential risk factor) for a comparison between patients with liver cancer and those without, those found to be significant were selected as the factors. We subsequently performed data training and testing to construct artificial neural network (ANN) and logistic regression (LR) prediction models. The dataset was randomly divided into 2 groups: a training group and a test group. The training group consisted of 1442 cases (70% of the entire dataset), and the prediction model was developed on the basis of the training group. The remaining 30% (618 cases) were assigned to the test group for model validation.
The following 10 variables were used to develop the ANN and LR models: sex, age, alcoholic cirrhosis, nonalcoholic cirrhosis, alcoholic hepatitis, viral hepatitis, other types of chronic hepatitis, alcoholic fatty liver disease, other types of fatty liver disease, and hyperlipidemia. The performance of the ANN was superior to that of LR, according to the sensitivity (0.757), specificity (0.755), and the area under the receiver operating characteristic curve (0.873). After developing the optimal prediction model, we base on this model to construct a web-based application system for liver cancer prediction, which can provide support to physicians during consults with diabetes patients.
In the original dataset (n=2060), 33% of diabetes patients were diagnosed with liver cancer (n=515). After using 70% of the original data to training the model and other 30% for testing, the sensitivity and specificity of our model were 0.757 and 0.755, respectively; this means that 75.7% of diabetes patients can be predicted correctly to receive a future liver cancer diagnosis, and 75.5% can be predicted correctly to not be diagnosed with liver cancer. These results reveal that this model can be used as effective predictors of liver cancer for diabetes patients, after discussion with physicians; they also agreed that model can assist physicians to advise potential liver cancer patients and also helpful to decrease the future cost incurred upon cancer treatment.
糖尿病与肝癌风险增加有关,这两种疾病是台湾最常见和最重要的发病和死亡原因之一。
使用数据挖掘技术为 2 型糖尿病患者建立一个 6 年内发生肝癌的预测模型。
数据来自台湾全民健康保险研究数据库(NHIRD),涵盖约 2200 万人。本研究选择 2000-2003 年期间新诊断为 2 型糖尿病且无癌症既往诊断的患者。然后,我们使用加密的个人 ID 与癌症登记数据库进行数据链接,以确定这些患者是否被诊断为肝癌。最后,我们确定了 2060 例病例,并将其分为病例组(糖尿病后诊断为肝癌的患者)和对照组(糖尿病但无肝癌的患者)。危险因素通过文献回顾和医生建议确定,然后对每个独立变量(或潜在危险因素)进行卡方检验,以比较肝癌患者和非肝癌患者之间的差异,选择有统计学意义的变量作为因素。随后,我们进行了数据训练和测试,构建了人工神经网络(ANN)和逻辑回归(LR)预测模型。数据集随机分为两组:训练组和测试组。训练组包括 1442 例(整个数据集的 70%),并基于训练组建立预测模型。其余 30%(618 例)被分配到测试组进行模型验证。
ANN 和 LR 模型使用了以下 10 个变量:性别、年龄、酒精性肝硬化、非酒精性肝硬化、酒精性肝炎、病毒性肝炎、其他类型慢性肝炎、酒精性脂肪肝疾病、其他类型脂肪肝疾病和高脂血症。根据灵敏度(0.757)、特异性(0.755)和接收器工作特征曲线下面积(0.873),ANN 的性能优于 LR。在开发出最佳预测模型后,我们基于该模型构建了一个用于肝癌预测的基于网络的应用系统,为医生在与糖尿病患者咨询时提供支持。
在原始数据集(n=2060)中,33%的糖尿病患者被诊断为肝癌(n=515)。使用原始数据的 70%进行模型训练和其余 30%进行测试后,我们模型的灵敏度和特异性分别为 0.757 和 0.755,这意味着 75.7%的糖尿病患者可以正确预测未来的肝癌诊断,75.5%可以正确预测不会被诊断为肝癌。这些结果表明,该模型可以作为糖尿病患者肝癌的有效预测因子,在与医生讨论后;他们还同意该模型可以帮助医生为潜在的肝癌患者提供建议,并有助于降低癌症治疗的未来成本。