Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 1620 Tremont St Suite 303, Boston, MA, 02120, USA.
Division of Epidemiology, Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT, USA.
Drug Saf. 2019 Nov;42(11):1297-1309. doi: 10.1007/s40264-019-00851-0.
Research that makes secondary use of administrative and clinical healthcare databases is increasingly influential for regulatory, reimbursement, and other healthcare decision-making. Consequently, there are numerous guidance documents on reporting for studies that use 'real-world' data captured in administrative claims and electronic health record (EHR) databases. These guidance documents are intended to improve transparency, reproducibility, and the ability to evaluate validity and relevance of design and analysis decisions. However, existing guidance does not differentiate between structured and unstructured information contained in EHRs, registries, or other healthcare data sources. While unstructured text is convenient and readily interpretable in clinical practice, it can be difficult to use for investigation of causal questions, e.g., comparative effectiveness and safety, until data have been cleaned and algorithms applied to extract relevant information to structured fields for analysis. The goal of this paper is to increase transparency for healthcare decision makers and causal inference researchers by providing general recommendations for reporting on steps taken to make unstructured text-based data usable for comparative effectiveness and safety research. These recommendations are designed to be used as an adjunct for existing reporting guidance. They are intended to provide sufficient context and supporting information for causal inference studies involving use of natural language processing- or machine learning-derived data fields, so that researchers, reviewers, and decision makers can be confident in their ability to evaluate the validity and relevance of derived measures for exposures, inclusion/exclusion criteria, covariates, and outcomes for the causal question of interest.
利用行政和临床医疗保健数据库进行二次利用的研究对于监管、报销和其他医疗保健决策越来越有影响力。因此,有许多关于使用“真实世界”数据(从行政索赔和电子健康记录 (EHR) 数据库中捕获)进行研究的报告指南。这些指南旨在提高透明度、可重复性以及评估设计和分析决策的有效性和相关性的能力。然而,现有的指南并没有区分 EHR、注册表或其他医疗保健数据源中包含的结构化和非结构化信息。虽然非结构化文本在临床实践中方便且易于解释,但在用于调查因果问题(例如,比较有效性和安全性)时可能会很困难,除非已经清理了数据并应用算法将相关信息提取到结构化字段中进行分析。本文的目的是通过提供有关将基于非结构化文本的数据用于比较有效性和安全性研究的步骤的报告的一般建议,为医疗保健决策者和因果推理研究人员提高透明度。这些建议旨在作为现有报告指南的补充使用。它们旨在为涉及使用自然语言处理或机器学习衍生数据字段的因果推理研究提供足够的上下文和支持信息,以便研究人员、审查人员和决策者能够有信心评估衍生措施对于感兴趣的因果问题的暴露、纳入/排除标准、协变量和结果的有效性和相关性。