Pate Alexander, Parisi Rosa, Kontopantelis Evangelos, Sperrin Matthew
Division of Informatics, Imaging and Data Sciences, University of Manchester.
PLoS One. 2025 Aug 19;20(8):e0327229. doi: 10.1371/journal.pone.0327229. eCollection 2025.
The Clinical Practice Research Datalink (CPRD) is a large and widely used resource of electronic health records from the UK, linking primary care data to hospital data, death registration data, cancer registry data, deprivation data and mental health services data. Extraction and management of CPRD data is a computationally demanding process and requires a significant amount of work, in particular when using R. The rcprd package simplifies the process of extracting and processing CPRD data in order to build datasets ready for statistical analysis. Raw CPRD data is provided in thousands of.txt files, making querying this data cumbersome and inefficient. rcprd saves the relevant information into an SQLite database stored on the hard drive which can then be queried efficiently to extract required information about individuals. rcprd follows a four-stage process: 1) Definition of a cohort, 2) Read in medical/prescription data and save into an SQLite database, 3) Query this SQLite database for specific codes and tests to create variables for each individual in the cohort, 4) Combine extracted variables into a dataset ready for statistical analysis. Functions are available to extract common variable types (e.g., history of a condition, or time until an event occurs, relative to an index date), and more general functions for database queries, allowing users to define their own variables for extraction. The entire process can be done from within R, with no knowledge of SQL required. This manuscript showcases the functionality of rcprd by running through an example using simulated CPRD Aurum data. rcprd will reduce the duplication of time and effort among those using CPRD data for research, allowing more time to be focused on other aspects of research projects.
临床实践研究数据链(CPRD)是一个来自英国的大型且广泛使用的电子健康记录资源,它将初级保健数据与医院数据、死亡登记数据、癌症登记数据、贫困数据和心理健康服务数据相链接。CPRD数据的提取和管理是一个计算量很大的过程,需要大量工作,特别是在使用R语言时。rcprd包简化了提取和处理CPRD数据的过程,以便构建可供统计分析的数据集。原始CPRD数据以数千个.txt文件的形式提供,查询这些数据既麻烦又低效。rcprd将相关信息保存到存储在硬盘上的SQLite数据库中,然后可以高效地查询该数据库以提取有关个体的所需信息。rcprd遵循四个阶段的过程:1)定义队列;2)读取医疗/处方数据并保存到SQLite数据库中;3)在这个SQLite数据库中查询特定代码和测试,为队列中的每个个体创建变量;4)将提取的变量组合成一个可供统计分析的数据集。有一些函数可用于提取常见变量类型(例如,某种疾病的病史,或相对于索引日期直到事件发生的时间),还有更通用的数据库查询函数,允许用户定义自己要提取的变量。整个过程可以在R语言中完成,无需了解SQL。本文通过使用模拟的CPRD奥鲁姆数据运行一个示例来展示rcprd的功能。rcprd将减少使用CPRD数据进行研究的人员的时间和精力重复,使更多时间能够专注于研究项目的其他方面。