Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia.
Institute of Computer Science, University of Tartu, Tartu, Estonia.
Eur J Med Res. 2023 Mar 25;28(1):133. doi: 10.1186/s40001-023-01087-6.
Ischemic stroke (IS) is a major health risk without generally usable effective measures of primary prevention. Early warning signals that are easy to detect and widely available can save lives. Estonia has one nation-wide Electronic Health Record (EHR) database for the storage of medical information of patients from hospitals and primary care providers.
We extracted structured and unstructured data from the EHRs of participants of the Estonian Biobank (EstBB) and evaluated different formats of input data to understand how this continuously growing dataset should be prepared for best prediction. The utility of the EHR database for finding blood- and urine-based biomarkers for IS was demonstrated by applying different analytical and machine learning (ML) methods.
Several early trends in common clinical laboratory parameter changes (set of red blood indices, lymphocyte/neutrophil ratio, etc.) were established for IS prediction. The developed ML models predicted the future occurrence of IS with very high accuracy and Random Forests was proved as the most applicable method to EHR data.
We conclude that the EHR database and the risk factors uncovered are valuable resources in screening the population for risk of IS as well as constructing disease risk scores and refining prediction models for IS by ML.
缺血性中风(IS)是一个重大的健康风险,目前尚无普遍可用的有效一级预防措施。易于发现且广泛可用的早期预警信号可以挽救生命。爱沙尼亚有一个全国性的电子健康记录(EHR)数据库,用于存储来自医院和初级保健提供者的患者的医疗信息。
我们从爱沙尼亚生物库(EstBB)参与者的 EHR 中提取了结构化和非结构化数据,并评估了不同格式的输入数据,以了解如何为最佳预测准备这个不断增长的数据集。通过应用不同的分析和机器学习(ML)方法,证明了 EHR 数据库在寻找基于血液和尿液的 IS 生物标志物方面的实用性。
确定了几种常见临床实验室参数变化(一组红细胞指数、淋巴细胞/中性粒细胞比值等)的早期趋势,用于 IS 预测。开发的 ML 模型对 IS 的未来发生具有非常高的准确性,随机森林被证明是最适用于 EHR 数据的方法。
我们得出结论,EHR 数据库和发现的风险因素是通过 ML 筛选 IS 风险人群、构建疾病风险评分和改进 IS 预测模型的有价值资源。