BMC Med Res Methodol. 2013 Aug 21;13:105. doi: 10.1186/1471-2288-13-105.
Primary care databases are a major source of data for epidemiological and health services research. However, most studies are based on coded information, ignoring information stored in free text. Using the early presentation of rheumatoid arthritis (RA) as an exemplar, our objective was to estimate the extent of data hidden within free text, using a keyword search.
We examined the electronic health records (EHRs) of 6,387 patients from the UK, aged 30 years and older, with a first coded diagnosis of RA between 2005 and 2008. We listed indicators for RA which were present in coded format and ran keyword searches for similar information held in free text. The frequency of indicator code groups and keywords from one year before to 14 days after RA diagnosis were compared, and temporal relationships examined.
One or more keyword for RA was found in the free text in 29% of patients prior to the RA diagnostic code. Keywords for inflammatory arthritis diagnoses were present for 14% of patients whereas only 11% had a diagnostic code. Codes for synovitis were found in 3% of patients, but keywords were identified in an additional 17%. In 13% of patients there was evidence of a positive rheumatoid factor test in text only, uncoded. No gender differences were found. Keywords generally occurred close in time to the coded diagnosis of rheumatoid arthritis. They were often found under codes indicating letters and communications.
Potential cases may be missed or wrongly dated when coded data alone are used to identify patients with RA, as diagnostic suspicions are frequently confined to text. The use of EHRs to create disease registers or assess quality of care will be misleading if free text information is not taken into account. Methods to facilitate the automated processing of text need to be developed and implemented.
初级保健数据库是进行流行病学和卫生服务研究的主要数据来源。然而,大多数研究都是基于编码信息,而忽略了存储在自由文本中的信息。以类风湿关节炎(RA)的早期表现为例,我们的目的是使用关键字搜索来估计隐藏在自由文本中的数据量。
我们检查了来自英国的 6387 名年龄在 30 岁及以上的患者的电子健康记录(EHR),这些患者在 2005 年至 2008 年间首次被编码诊断为 RA。我们列出了以编码形式存在的 RA 指标,并对自由文本中类似的信息进行了关键字搜索。比较了 RA 诊断前一年到 14 天内指标代码组和关键字的频率,并检查了时间关系。
在 RA 诊断代码之前,有 29%的患者的自由文本中发现了一个或多个 RA 的关键字。14%的患者有炎症性关节炎诊断的关键字,而只有 11%的患者有诊断代码。3%的患者有滑膜炎的代码,但另外 17%的患者有相关关键字。在 13%的患者中,只有文字记录而没有编码的类风湿因子检测呈阳性。未发现性别差异。关键字通常与 RA 的编码诊断时间接近。它们通常出现在指示信件和通信的代码下。
仅使用编码数据来识别 RA 患者可能会错过或错误地记录潜在病例,因为诊断怀疑通常仅限于文本。如果不考虑自由文本信息,使用 EHR 来创建疾病登记册或评估护理质量将产生误导。需要开发和实施方法来促进文本的自动处理。