Suppr超能文献

用于从病理报告预测当前操作术语(CPT)代码的机器学习算法比较

Comparison of Machine-Learning Algorithms for the Prediction of Current Procedural Terminology (CPT) Codes from Pathology Reports.

作者信息

Levy Joshua, Vattikonda Nishitha, Haudenschild Christian, Christensen Brock, Vaickus Louis

机构信息

Emerging Diagnostic and Investigative Technologies, Clinical Genomics and Advanced Technologies, Department of Pathology and Laboratory Medicine, Dartmouth Hitchcock Medical Center, Lebanon, New Hampshire, USA.

Department of Epidemiology, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, USA.

出版信息

J Pathol Inform. 2022 Jan 5;13:3. doi: 10.4103/jpi.jpi_52_21. eCollection 2022.

Abstract

BACKGROUND

Pathology reports serve as an auditable trial of a patient's clinical narrative, containing text pertaining to diagnosis, prognosis, and specimen processing. Recent works have utilized natural language processing (NLP) pipelines, which include rule-based or machine-learning analytics, to uncover textual patterns that inform clinical endpoints and biomarker information. Although deep learning methods have come to the forefront of NLP, there have been limited comparisons with the performance of other machine-learning methods in extracting key insights for the prediction of medical procedure information, which is used to inform reimbursement for pathology departments. In addition, the utility of combining and ranking information from multiple report subfields as compared with exclusively using the diagnostic field for the prediction of Current Procedural Terminology (CPT) codes and signing pathologists remains unclear.

METHODS

After preprocessing pathology reports, we utilized advanced topic modeling to identify topics that characterize a cohort of 93,039 pathology reports at the Dartmouth-Hitchcock Department of Pathology and Laboratory Medicine (DPLM). We separately compared XGBoost, SVM, and BERT (Bidirectional Encoder Representation from Transformers) methodologies for the prediction of primary CPT codes (CPT 88302, 88304, 88305, 88307, 88309) as well as 38 ancillary CPT codes, using both the diagnostic text alone and text from all subfields. We performed similar analyses for characterizing text from a group of the 20 pathologists with the most pathology report sign-outs. Finally, we uncovered important report subcomponents by using model explanation techniques.

RESULTS

We identified 20 topics that pertained to diagnostic and procedural information. Operating on diagnostic text alone, BERT outperformed XGBoost for the prediction of primary CPT codes. When utilizing all report subfields, XGBoost outperformed BERT for the prediction of primary CPT codes. Utilizing additional subfields of the pathology report increased prediction accuracy across ancillary CPT codes, and performance gains for using additional report subfields were high for the XGBoost model for primary CPT codes. Misclassifications of CPT codes were between codes of a similar complexity, and misclassifications between pathologists were subspecialty related.

CONCLUSIONS

Our approach generated CPT code predictions with an accuracy that was higher than previously reported. Although diagnostic text is an important source of information, additional insights may be extracted from other report subfields. Although BERT approaches performed comparably to the XGBoost approaches, they may lend valuable information to pipelines that combine image, text, and -omics information. Future resource-saving opportunities exist to help hospitals detect mis-billing, standardize report text, and estimate productivity metrics that pertain to pathologist compensation (RVUs).

摘要

背景

病理报告是患者临床病历的可审计记录,包含与诊断、预后和标本处理相关的文本。最近的研究利用自然语言处理(NLP)管道,包括基于规则或机器学习分析,来发现能够为临床终点和生物标志物信息提供依据的文本模式。尽管深度学习方法已成为NLP的前沿技术,但在提取用于预测医疗程序信息(用于病理科报销)的关键见解方面,与其他机器学习方法的性能比较有限。此外,与仅使用诊断字段来预测当前程序术语(CPT)代码和签署病理学家相比,将来自多个报告子字段的信息进行组合和排序的效用仍不明确。

方法

在对病理报告进行预处理后,我们利用先进的主题建模来识别表征达特茅斯 - 希区柯克病理与检验医学部(DPLM)93,039份病理报告队列的主题。我们分别比较了XGBoost、支持向量机(SVM)和BERT(来自变换器的双向编码器表示)方法在预测主要CPT代码(CPT 88302、88304、88305、88307、88309)以及38个辅助CPT代码时的性能,同时使用单独的诊断文本和所有子字段的文本。我们对一组签署病理报告最多的20位病理学家的文本进行了类似分析。最后,我们使用模型解释技术发现了重要的报告子组件。

结果

我们识别出20个与诊断和程序信息相关的主题。仅使用诊断文本时,BERT在预测主要CPT代码方面优于XGBoost。当使用所有报告子字段时,XGBoost在预测主要CPT代码方面优于BERT。利用病理报告的其他子字段可提高辅助CPT代码的预测准确性,并且对于主要CPT代码的XGBoost模型,使用额外报告子字段的性能提升很高。CPT代码的错误分类发生在复杂度相似的代码之间,病理学家之间的错误分类与亚专业相关。

结论

我们的方法生成的CPT代码预测准确性高于先前报告。尽管诊断文本是重要的信息来源,但可以从其他报告子字段中提取更多见解。尽管BERT方法的表现与XGBoost方法相当,但它们可能为结合图像、文本和组学信息的管道提供有价值的信息。未来存在节省资源的机会,可帮助医院检测计费错误、规范报告文本并估计与病理学家薪酬(相对价值单位)相关的生产率指标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55a9/8802304/55b39a4afc4f/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验