Namli Tuncay, Anıl Sınacı Ali, Gönül Suat, Herguido Cristina Ruiz, Garcia-Canadilla Patricia, Muñoz Adriana Modrego, Esteve Arnau Valls, Ertürkmen Gökçe Banu Laleci
SRDC Software Research Development and Consultancy A. Ş., Ankara, Turkey.
Fundacio Sant Joan De Deu, Barcelona, Spain.
Front Med (Lausanne). 2024 Jul 30;11:1393123. doi: 10.3389/fmed.2024.1393123. eCollection 2024.
Transparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.
We propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.
We implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for "predicting complications after cardiac surgeries".
Through the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.
透明度和可追溯性对于建立值得信赖的人工智能(AI)至关重要。数据准备过程中缺乏透明度是开发可靠AI系统的重大障碍,这可能导致与可重复性、调试AI模型、偏差和公平性以及合规性和监管相关的问题。我们引入了一个正式的数据准备管道规范,以改进AI和数据分析应用中使用的手动且容易出错的数据提取过程,重点是可追溯性。
我们提出一种声明性语言,用于定义从健康数据中提取符合通用数据模型的AI就绪数据集,特别是那些符合HL7快速医疗保健互操作性资源(FHIR)的数据。我们利用FHIR概要文件开发一个针对AI用例量身定制的通用数据模型,以明确声明所需信息,如表型和AI特征定义。在我们的管道模型中,我们通过定义目标人群、特征组和最终数据集,将以不规则时间序列采样表示的复杂、高维电子健康记录数据转换为扁平结构。我们的设计考虑了来自不同项目的各种AI用例的要求,这些要求导致实现了许多具有复杂时间关系的特征类型。
我们实现了一个可扩展且高性能的特征存储库,以执行数据准备管道定义。该软件不仅确保可靠、容错的分布式处理,以生成AI就绪数据集及其元数据(包括许多统计信息),而且在在线预测期间还可作为基于训练好的AI模型的决策支持应用程序的可插拔组件,自动准备单个实体的特征值。我们在三个不同的研究项目中部署并测试了所提出的方法和实现。在为 “预测心脏手术后的并发症” 训练AI模型时,我们展示了所开发的FHIR概要文件作为通用数据模型、数据准备管道中的特征组定义和特征定义。
通过在各种试点用例中的实施,已证明我们的框架具有必要的广度和灵活性,能够定义各种特征,每个特征都针对特定的时间和上下文标准进行了定制。