Suppr超能文献

全新自然语言处理算法可从病理报告中准确识别黏液纤维肉瘤。

De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.

作者信息

Lindsay Sarah E, Madison Cecelia J, Ramsey Duncan C, Doung Yee-Cheen, Gundle Kenneth R

机构信息

Department of Orthopaedics and Rehabilitation, Oregon Health & Science University, Portland, OR, USA.

Portland VA Medical Center, Portland, OR, USA.

出版信息

Clin Orthop Relat Res. 2025 Jan 1;483(1):80-87. doi: 10.1097/CORR.0000000000003270. Epub 2024 Oct 2.

Abstract

BACKGROUND

Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.

QUESTIONS/PURPOSES: (1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?

METHODS

All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.

RESULTS

The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.

CONCLUSION

An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub ( https://github.com/sarcoma-shark/myxofibrosarcoma-shark ) and is available to facilitate external validation and improvement through testing in other cohorts.

LEVEL OF EVIDENCE

Level II, diagnostic study.

摘要

背景

国际疾病分类第十版(ICD - 10)中的现有编码不能准确反映软组织肉瘤的诊断情况,这可能导致数据库中软组织肉瘤的代表性不足。美国退伍军人事务部(VA)数据库提供了一个独特的软组织肉瘤研究机会,因为它能获取所有临床结果和病理报告。在软组织肉瘤的背景下,自然语言处理(NLP)有潜力应用于病理报告等临床文档,以独立于ICD编码识别软组织肉瘤,使肉瘤研究人员能够建立更全面的数据库,从而回答众多研究问题。

问题/目的:(1)仅通过软组织肉瘤ICD编码在VA数据库中搜索,会遗漏多少黏液纤维肉瘤患者?(2)一种全新的NLP算法能否分析病理报告以准确识别黏液纤维肉瘤患者?

方法

从2003年到2022年,在VA国家企业数据仓库中识别出所有病理报告(共1070万份)。使用词搜索功能,发现403名退伍军人的报告中包含“黏液纤维肉瘤”一词。对得到的病理报告进行人工审核,以建立一个金标准队列,该队列仅包含那些经病理学家确诊为黏液纤维肉瘤的退伍军人。该队列的平均年龄±标准差为70±12岁,96%(300人中的287人)为男性。提取诊断编码,并比较适当ICD编码的差异。使用黏液纤维肉瘤的混杂因素、否定词和强调词对NLP算法进行迭代优化和测试。通过与人工审核的金标准队列比较,计算NLP生成队列的敏感性、特异性、阳性预测值(PPV)、阴性预测值(NPV)和准确性。

结果

VA数据库中27%(300人中的81人)的黏液纤维肉瘤患者记录缺少肉瘤ICD编码。与ICD编码(73%[300人中的219人])或基本词搜索(74%[403人中的300人])相比,一种全新的NLP算法能更准确地(92%[300人中的276人])识别黏液纤维肉瘤患者(p<0.001)。生成了三个最终算法模型,准确率在92%至100%之间。

结论

一种NLP算法能够从病理报告中高精度地识别黏液纤维肉瘤患者,这比基于ICD的队列创建和简单词搜索有所改进。该算法可在GitHub(https://github.com/sarcoma - shark/myxofibrosarcoma - shark)上免费获取,可供其他队列进行测试以促进外部验证和改进。

证据水平

二级,诊断性研究。

相似文献

4
Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.用于 SARS-CoV-2 感染诊断的快速、即时抗原检测。
Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.
10
Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。
Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

本文引用的文献

9
Incidence and survival of rare cancers in the US and Europe.美国和欧洲罕见癌症的发病率和生存率。
Cancer Med. 2020 Aug;9(15):5632-5642. doi: 10.1002/cam4.3137. Epub 2020 May 21.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验