Kop Reinier, Hoogendoorn Mark, Teije Annette Ten, Büchner Frederike L, Slottje Pauline, Moons Leon M G, Numans Mattijs E
VU University Amsterdam, Department of Computer Science, Amsterdam, The Netherlands.
VU University Amsterdam, Department of Computer Science, Amsterdam, The Netherlands.
Comput Biol Med. 2016 Sep 1;76:30-8. doi: 10.1016/j.compbiomed.2016.06.019. Epub 2016 Jun 22.
Over the past years, research utilizing routine care data extracted from Electronic Medical Records (EMRs) has increased tremendously. Yet there are no straightforward, standardized strategies for pre-processing these data. We propose a dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature. The pipeline is demonstrated on a dataset of routinely recorded data in general practice EMRs of over 260,000 patients, in which the occurrence of colorectal cancer (CRC) is predicted using various machine learning techniques (i.e., CART, LR, RF) and subsets of the data. CRC is a common type of cancer, of which early detection has proven to be important yet challenging. The results are threefold. First, the predictive models generated using our pipeline reconfirmed known predictors and identified new, medically plausible, predictors derived from the cardiovascular and metabolic disease domain, validating the pipeline's effectiveness. Second, the difference between the best model generated by the data-driven subset (AUC 0.891) and the best model generated by the current state of the art hypothesis-driven subset (AUC 0.864) is statistically significant at the 95% confidence interval level. Third, the pipeline itself is highly generic and independent of the specific disease targeted and the EMR used. In conclusion, the application of established machine learning techniques in combination with the proposed pipeline on EMRs has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.
在过去几年中,利用从电子病历(EMR)中提取的常规护理数据进行的研究大幅增加。然而,对于这些数据的预处理,尚无直接、标准化的策略。我们提出了一个专门的医学预处理流程,旨在解决EMR数据中存在的诸多问题并把握其中的机会,比如数据的时效性、不准确和不完整等特性。该流程在一个包含超过260,000名患者的全科医疗EMR常规记录数据的数据集上得到了验证,其中使用各种机器学习技术(即CART、LR、RF)和数据子集对结直肠癌(CRC)的发生情况进行了预测。CRC是一种常见的癌症类型,早期检测已被证明既重要又具有挑战性。结果有三个方面。首先,使用我们的流程生成的预测模型再次确认了已知的预测因素,并识别出了源自心血管和代谢疾病领域的新的、医学上合理的预测因素,验证了该流程的有效性。其次,数据驱动子集生成的最佳模型(AUC 0.891)与当前最先进的假设驱动子集生成的最佳模型(AUC 0.864)之间的差异在95%置信区间水平上具有统计学意义。第三,该流程本身具有高度通用性,与所针对的特定疾病和所使用的EMR无关。总之,将既定的机器学习技术与所提出的流程结合应用于EMR,在增强疾病预测方面具有巨大潜力,从而在医学实践中实现早期检测和干预。