Griffier Romain, Mougin Fleur, Jouhet Vianney
Service d'Information Médicale, Informatique et Archivistique Médicale (IAM), Pôle de Santé Publique, Bordeaux University Hospital, Bordeaux, France.
Team AHeaD, Inserm Bordeaux Population Health Research Center, UMR 1219, Bordeaux University, Bordeaux, France.
JMIR Med Inform. 2025 Apr 24;13:e65753. doi: 10.2196/65753.
The volume of digital data in health care is continually growing. In addition to its use in health care, the health data collected can also serve secondary purposes, such as research. In this context, clinical data warehouses (CDWs) provide the infrastructure and organization necessary to enhance the secondary use of health data. Various data models have been proposed for structuring data in a CDW, including the Informatics for Integrating Biology & the Bedside (i2b2) model, which relies on a relational database. However, this persistence approach can lead to performance issues when executing queries on massive data sets.
This study aims to describe the necessary transformations and their implementation to enable i2b2's search engine to perform the phenotyping task using data persistence in a NoSQL Elasticsearch database.
This study compares data persistence in a standard relational database with a NoSQL Elasticsearch database in terms of query response and execution performance (focusing on counting queries based on structured data, numerical data, and free text, including temporal filtering) as well as material resource requirements. Additionally, the data loading and updating processes are described.
We propose adaptations to the i2b2 model to accommodate the specific features of Elasticsearch, particularly its inability to perform joins between different indexes. The implementation was tested and evaluated within the CDW of Bordeaux University Hospital, which contains data on 2.5 million patients and over 3 billion observations. Overall, Elasticsearch achieves shorter query execution times compared with a relational database, with particularly significant performance gains for free-text searches. Additionally, compared with an indexed relational database (including a full-text index), Elasticsearch requires less disk space for storage.
We demonstrate that implementing i2b2 with Elasticsearch is feasible and significantly improves query performance while reducing disk space usage. This implementation is currently in production at Bordeaux University Hospital.
医疗保健领域的数字数据量在持续增长。除了用于医疗保健本身,所收集的健康数据还可用于诸如研究等次要目的。在此背景下,临床数据仓库(CDW)提供了增强健康数据二次利用所需的基础设施和组织架构。已经提出了各种数据模型用于在CDW中构建数据,包括整合生物学与床边信息学(i2b2)模型,该模型依赖于关系数据库。然而,这种持久化方法在对海量数据集执行查询时可能会导致性能问题。
本研究旨在描述必要的转换及其实现方式,以使i2b2的搜索引擎能够使用NoSQL Elasticsearch数据库中的数据持久化来执行表型分析任务。
本研究在查询响应和执行性能(重点是基于结构化数据、数值数据和自由文本的计数查询,包括时间过滤)以及物质资源需求方面,比较了标准关系数据库与NoSQL Elasticsearch数据库中的数据持久化情况。此外,还描述了数据加载和更新过程。
我们建议对i2b2模型进行调整,以适应Elasticsearch的特定特性,特别是其无法在不同索引之间执行连接操作的特性。该实现在波尔多大学医院的CDW中进行了测试和评估,该CDW包含250万患者的数据和超过30亿条观测数据。总体而言,与关系数据库相比,Elasticsearch实现了更短的查询执行时间,对于自由文本搜索的性能提升尤为显著。此外,与索引关系数据库(包括全文索引)相比,Elasticsearch存储所需的磁盘空间更少。
我们证明了使用Elasticsearch实现i2b2是可行的,并且在显著提高查询性能的同时减少了磁盘空间使用。此实现在波尔多大学医院目前已投入使用。