Institute of primary care, University and University Hospital Zurich, Pestalozzistr. 24, Zürich, 8091, Switzerland.
BMC Prim Care. 2024 Jul 16;25(1):257. doi: 10.1186/s12875-024-02514-1.
Diagnoses entered by general practitioners into electronic medical records have great potential for research and practice, but unfortunately, diagnoses are often in uncoded format, making them of little use. Natural language processing (NLP) could assist in coding free-text diagnoses, but NLP models require local training data to unlock their potential. The aim of this study was to develop a framework of research-relevant diagnostic codes, to test the framework using free-text diagnoses from a Swiss primary care database and to generate training data for NLP modelling.
The framework of diagnostic codes was developed based on input from local stakeholders and consideration of epidemiological data. After pre-testing, the framework contained 105 diagnostic codes, which were then applied by two raters who independently coded randomly drawn lines of free text (LoFT) from diagnosis lists extracted from the electronic medical records of 3000 patients of 27 general practitioners. Coding frequency and mean occurrence rates (n and %) and inter-rater reliability (IRR) of coding were calculated using Cohen's kappa (Κ).
The sample consisted of 26,980 LoFT and in 56.3% no code could be assigned because it was not a specific diagnosis. The most common diagnostic codes were, 'dorsopathies' (3.9%, a code covering all types of back problems, including non-specific lower back pain, scoliosis, and others) and 'other diseases of the circulatory system' (3.1%). Raters were in almost perfect agreement (Κ ≥ 0.81) for 69 of the 105 diagnostic codes, and 28 codes showed a substantial agreement (K between 0.61 and 0.80). Both high coding frequency and almost perfect agreement were found in 37 codes, including codes that are particularly difficult to identify from components of the electronic medical record, such as musculoskeletal conditions, cancer or tobacco use.
The coding framework was characterised by a subset of very frequent and highly reliable diagnostic codes, which will be the most valuable targets for training NLP models for automated disease classification based on free-text diagnoses from Swiss general practice.
全科医生在电子病历中输入的诊断具有很大的研究和实践潜力,但不幸的是,这些诊断通常是未编码的格式,因此用处不大。自然语言处理(NLP)可以帮助对自由文本诊断进行编码,但 NLP 模型需要本地训练数据来释放其潜力。本研究的目的是开发一个与研究相关的诊断代码框架,使用瑞士初级保健数据库中的自由文本诊断来测试该框架,并为 NLP 建模生成训练数据。
诊断代码框架是基于当地利益相关者的意见输入和考虑流行病学数据而开发的。经过预测试,该框架包含 105 个诊断代码,然后由两名评估者使用,他们分别对从 27 名全科医生的电子病历中提取的诊断列表中随机抽取的自由文本(LoFT)行进行独立编码。使用 Cohen's kappa(Κ)计算编码的编码频率和平均发生率(n 和%)和评估者间一致性(IRR)。
样本包括 26980 LoFT,56.3%的诊断没有分配代码,因为它们不是特定的诊断。最常见的诊断代码是“dorsopathies”(3.9%,一个涵盖所有类型背部问题的代码,包括非特异性下背痛、脊柱侧凸等)和“其他循环系统疾病”(3.1%)。对于 105 个诊断代码中的 69 个,评估者之间几乎完全一致(Κ≥0.81),28 个代码显示出实质性的一致性(Κ介于 0.61 和 0.80 之间)。在 37 个代码中同时发现了高编码频率和几乎完美的一致性,包括那些特别难以从电子病历的组成部分中识别出来的代码,如肌肉骨骼疾病、癌症或吸烟习惯。
该编码框架的特点是一组非常频繁且高度可靠的诊断代码,这将是基于瑞士全科医生的自由文本诊断进行自动疾病分类的 NLP 模型培训的最有价值的目标。