Abrahão Maria Tereza Fernandes, Nobre Moacyr Roberto Cuce, Gutierrez Marco Antonio
Program in Cardiology, Heart Institute (InCor) Clinical Hospital, Faculty of Medicine, University of Sao Paulo, Sao Paulo, Brazil.
Clinical Epidemiology Team, Heart Institute (InCor) Clinical Hospital, Faculty of Medicine, University of Sao Paulo, Sao Paulo, Brazil.
Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.
An electronic healthcare record (EHR) system, when used by healthcare providers, improves the quality of care for patients and helps to lower costs. Information collected from manual or electronic health records can also be used for purposes not directly related to patient care delivery, in which case it is termed secondary use. EHR systems facilitate the collection of this secondary use data, which can be used for research purposes like observational studies, taking advantage of improvement in the structuring and retrieval of patient information. However, some of the following problems are common when conducting a research using this kind of data: (i) Over time, systems and data storage methods become obsolete; (ii) Data concerns arise since the data is being used in a context removed from its original intention; (iii) There are privacy concerns when sharing data about individual subjects; (iv) The partial availability of standard medical vocabularies and natural language processing tools for non-English language limits information extraction from structured and unstructured data in the EHR systems. A systematic approach is therefore needed to overcome these, where local data processing is performed prior to data sharing.
The proposed study describes a local processing method to extract cohorts of patients for observational studies in four steps: (1) data reorganization from an existing local logical schema into a common external schema over which information can be extracted; (2) cleaning of data, generation of the database profile and retrieval of indicators; (3) computation of derived variables from original variables; (4) application of study design parameters to transform longitudinal data into anonymized data sets ready for statistical analysis and sharing. Mapping from the local logical schema into a common external schema must be performed differently for each EHR and is not subject of this work, but step 2, 3 and 4 are common to all EHRs. The external schema accepts parameters that facilitate the extraction of different cohorts for different studies without having to change the extraction algorithms, and ensures that, given an immutable data set, can be done by the idempotent process. Statistical analysis is part of the process to generate the results necessary for inclusion in reports. The generation of indicators to describe the database allows description of its characteristics, highlighting study results. The set extraction/statistical processing is available in a version controlled repository and can be used at any time to reproduce results, allowing the verification of alterations and error corrections. This methodology promotes the development of reproducible studies and allows potential research problems to be tracked upon extraction algorithms and statistical methods RESULTS: This method was applied to an admissions database, SI, from the InCor-HCFMUSP, a tertiary referral hospital for cardiovascular disease in the city of São Paulo, as a source of secondary data with 1116848 patients records from 1999 up to 2013. The cleaning process resulted in 313894 patients records and 27698 patients in the cohort selection, with the following criteria: study period: 2003-2013, gender: Male, Female, age:≥18years old, at least 2 outpatient encounters, diagnosis of cardiovascular disease (ICD-10 codes: I20-I25, I64-I70 and G45). An R script provided descriptive statistics of the extracted cohort.
This method guarantees a reproducible cohort extraction for use of secondary data in observational studies with enough parameterization to support different study designs and can be used on diverse data sources. Moreover it allows observational electronic health record cohort research to be performed in a non-English language with limited international recognized medical vocabulary.
医疗服务提供者使用电子健康记录(EHR)系统可提高患者护理质量并有助于降低成本。从手动或电子健康记录中收集的信息也可用于与患者护理提供无直接关系的目的,在这种情况下,它被称为二次使用。EHR系统便于收集这种二次使用数据,这些数据可用于诸如观察性研究等研究目的,利用患者信息结构化和检索方面的改进。然而,在使用这类数据进行研究时,以下一些问题很常见:(i)随着时间推移,系统和数据存储方法会过时;(ii)由于数据在脱离其原始意图的背景下使用,会出现数据问题;(iii)在共享关于个体受试者的数据时存在隐私问题;(iv)非英语语言的标准医学词汇和自然语言处理工具部分可用,限制了从EHR系统中的结构化和非结构化数据提取信息。因此,需要一种系统方法来克服这些问题,即在数据共享之前进行本地数据处理。
拟议的研究描述了一种本地处理方法,分四个步骤为观察性研究提取患者队列:(1)将现有本地逻辑模式中的数据重组为可提取信息的通用外部模式;(2)数据清理、生成数据库概况并检索指标;(3)从原始变量计算派生变量;(4)应用研究设计参数将纵向数据转换为可用于统计分析和共享的匿名数据集。从本地逻辑模式映射到通用外部模式,每个EHR都必须以不同方式执行,且不是本工作的主题,但步骤2、3和4对所有EHR都是通用的。外部模式接受有助于为不同研究提取不同队列的参数,而无需更改提取算法,并确保在给定不可变数据集的情况下,可以通过幂等过程完成。统计分析是生成报告中包含的必要结果过程的一部分。生成描述数据库的指标可描述其特征,突出研究结果。集合提取/统计处理在版本受控的存储库中可用,可随时用于重现结果,允许验证更改和纠错。这种方法促进了可重复研究的发展,并允许在提取算法和统计方法方面跟踪潜在的研究问题。结果:此方法应用于圣保罗市一家心血管疾病三级转诊医院InCor-HCFMUSP的入院数据库SI,作为二次数据来源,有1999年至2013年的1116848条患者记录。清理过程产生了313894条患者记录,队列选择中有27698名患者,选择标准如下:研究期:2003 - 2013年,性别:男、女,年龄:≥18岁,至少2次门诊就诊,心血管疾病诊断(ICD - 10编码:I20 - I25、I64 - I70和G45)。一个R脚本提供了提取队列的描述性统计。
此方法保证了在观察性研究中使用二次数据时可重复的队列提取,具有足够的参数化以支持不同的研究设计,并且可用于各种数据源。此外,它允许在国际认可的医学词汇有限的非英语语言环境中进行观察性电子健康记录队列研究。