Suppr超能文献

利用电子健康记录中的自然语言处理自动推导肺癌诊断标准:一项试点研究。

Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study.

作者信息

Houston Andrew, Williams Sophie, Ricketts William, Gutteridge Charles, Tackaberry Chris, Conibear John

机构信息

Barts Life Sciences, Barts Health NHS Trust, London, UK.

Digital Environment Research Institute, Queen Mary University of London, London, UK.

出版信息

BMC Med Inform Decis Mak. 2024 Dec 4;24(1):371. doi: 10.1186/s12911-024-02790-y.

Abstract

BACKGROUND

The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data.

METHODS

Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria.

RESULTS

75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing 'disorder of the lung', such as pneumonia, and a 'cough' increased the probability of a lung cancer diagnosis. 'Anomalies of great vessel', 'disorder of the retroperitoneal compartment' and 'context-dependent findings', such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance.

CONCLUSIONS

The proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.

摘要

背景

医疗记录的数字化产生了大量非结构化数据,当临床编码不足时,例如在记录患者症状方面,这为疾病诊断的改进提供了机会。本研究提出了一种使用自然语言处理从自由文本中提取临床概念的方法,这些概念用于从非结构化二级医疗数据中自动形成肺癌的诊断标准。

方法

纳入2016年至2022年间接受胸部X光(CXR)检查的40岁及以上患者。从其在CXR前12个月的电子健康记录(EHR)中提取ICD - 10和非结构化数据。使用命名实体识别对非结构化数据进行处理以提取症状,这些症状被映射到SNOMED - CT代码。利用SNOMED - CT层次结构中的特征归纳来缓解稀疏特征问题,并应用基于频率的标准与单变量对数概率相结合,以选择进入模型开发阶段的候选特征。采用遗传算法识别最具区分性的特征以形成诊断标准。

结果

纳入75002名患者,其中1012例在CXR后12个月内被诊断为肺癌。表现最佳的模型的曲线下面积(AUROC)为0.72。结果表明,现有的“肺部疾病”,如肺炎,以及“咳嗽”会增加肺癌诊断的概率。“大血管异常”、“腹膜后腔疾病”以及“与上下文相关的发现”,如疼痛,在统计学上降低了患肺癌的风险,使其他诊断更有可能。将开发模型的性能与现有的癌症风险评分进行比较,显示出优越的性能。

结论

所提出的方法成功地利用非结构化二级医疗数据得出肺癌的诊断标准,优于现有的风险工具。这些进展显示了改善患者护理和治疗结果方面的潜力。然而,通过整合初级医疗数据来解决特定限制以确保更全面和无偏倚地制定诊断标准至关重要。此外,该研究强调了将SNOMED - CT概念转化为与临床医生产生共鸣的有意义术语的重要性,这有助于更清晰、更切实地理解所应用的标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0372/11616170/59b369565984/12911_2024_2790_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验