Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center of Health Data Science, Berlin, Germany.
Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Center of Health Data Science, Berlin, Germany.
Int J Med Inform. 2024 Dec;192:105646. doi: 10.1016/j.ijmedinf.2024.105646. Epub 2024 Oct 5.
Large-scale health data has significant potential for research and innovation, especially with longitudinal data offering insights into prevention, disease progression, and treatment effects. Yet, analyzing this data type is complex, as data points are repeatedly documented along the timeline. As a consequence, extracting cross-sectional tabular data suitable for statistical analysis and machine learning can be challenging for medical researchers and data scientists alike, with existing tools lacking balance between ease-of-use and comprehensiveness.
This paper introduces HERALD, a novel domain-specific query language designed to support the transformation of longitudinal health data into cross-sectional tables. We describe the basic concepts, the query syntax, a graphical user interface for constructing and executing HERALD queries, as well as an integration into Informatics for Integrating Biology and the Bedside (i2b2).
The syntax of HERALD mimics natural language and supports different query types for selection, aggregation, analysis of relationships, and searching for data points based on filter expressions and temporal constraints. Using a hierarchical concept model, queries are executed individually for the data of each patient, while constructing tabular output. HERALD is closed, meaning that queries process data points and generate data points. Queries can refer to data points that have been produced by previous queries, providing a simple, but powerful nesting mechanism.
The open-source implementation consists of a HERALD query parser, an execution engine, as well as a web-based user interface for query construction and statistical analysis. The implementation can be deployed as a standalone component and integrated into self-service data analytics environments like i2b2 as a plugin. HERALD can be valuable tool for data scientists and machine learning experts, as it simplifies the process of transforming longitudinal health data into tables and data matrices.
The construction of cross-sectional tables from longitudinal data can be supported through dedicated query languages that strike a reasonable balance between language complexity and transformation capabilities.
大规模健康数据具有重要的研究和创新潜力,尤其是纵向数据可以深入了解预防、疾病进展和治疗效果。然而,分析这种数据类型非常复杂,因为数据点会沿着时间轴反复记录。因此,对于医学研究人员和数据科学家来说,提取适合统计分析和机器学习的横截面表格数据可能具有挑战性,现有的工具在易用性和全面性之间缺乏平衡。
本文介绍了 HERALD,这是一种专门设计的领域特定查询语言,用于将纵向健康数据转换为横截面表格。我们描述了基本概念、查询语法、用于构建和执行 HERALD 查询的图形用户界面,以及与 Informatics for Integrating Biology and the Bedside (i2b2) 的集成。
HERALD 的语法模仿自然语言,支持不同的查询类型,用于选择、聚合、分析关系以及根据过滤表达式和时间约束搜索数据点。使用分层概念模型,为每个患者的数据单独执行查询,同时构建表格输出。HERALD 是封闭的,这意味着查询处理数据点并生成数据点。查询可以引用先前查询生成的数据点,提供了一种简单但强大的嵌套机制。
开源实现包括 HERALD 查询解析器、执行引擎以及用于查询构建和统计分析的基于 Web 的用户界面。该实现可以作为独立组件部署,并作为插件集成到自助式数据分析环境(如 i2b2)中。HERALD 对于数据科学家和机器学习专家来说可能是一个有价值的工具,因为它简化了将纵向健康数据转换为表格和数据矩阵的过程。
通过专门的查询语言可以支持从纵向数据构建横截面表格,该语言在语言复杂性和转换能力之间取得了合理的平衡。