用于医疗欺诈检测的以数据为中心的人工智能。

Data-Centric AI for Healthcare Fraud Detection.

作者信息

Johnson Justin M, Khoshgoftaar Taghi M

机构信息

Florida Atlantic University, Boca Raton, FL USA.

出版信息

SN Comput Sci. 2023;4(4):389. doi: 10.1007/s42979-023-01809-x. Epub 2023 May 11.

DOI:10.1007/s42979-023-01809-x

PMID:37200563

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10173919/

Abstract

Automated methods for detecting fraudulent healthcare providers have the potential to save billions of dollars in healthcare costs and improve the overall quality of patient care. This study presents a data-centric approach to improve healthcare fraud classification performance and reliability using Medicare claims data. Publicly available data from the Centers for Medicare & Medicaid Services (CMS) are used to construct nine large-scale labeled data sets for supervised learning. First, we leverage CMS data to curate the 2013-2019 Part B, Part D, and Durable Medical Equipment, Prosthetics, Orthotics, and Supplies (DMEPOS) Medicare fraud classification data sets. We provide a review of each data set and data preparation techniques to create Medicare data sets for supervised learning and we propose an improved data labeling process. Next, we enrich the original Medicare fraud data sets with up to 58 new provider summary features. Finally, we address a common model evaluation pitfall and propose an adjusted cross-validation technique that mitigates target leakage to provide reliable evaluation results. Each data set is evaluated on the Medicare fraud classification task using extreme gradient boosting and random forest learners, multiple complementary performance metrics, and 95% confidence intervals. Results show that the new enriched data sets consistently outperform the original Medicare data sets that are currently used in related works. Our results encourage the data-centric machine learning workflow and provide a strong foundation for data understanding and preparation techniques for machine learning applications in healthcare fraud.

摘要

检测欺诈性医疗服务提供者的自动化方法有潜力节省数十亿美元的医疗成本，并提高患者护理的整体质量。本研究提出了一种以数据为中心的方法，利用医疗保险索赔数据来提高医疗欺诈分类的性能和可靠性。来自医疗保险和医疗补助服务中心（CMS）的公开数据被用于构建九个大规模的标记数据集，用于监督学习。首先，我们利用CMS数据整理2013 - 2019年B部分、D部分以及耐用医疗设备、假肢、矫形器和用品（DMEPOS）的医疗保险欺诈分类数据集。我们对每个数据集以及为监督学习创建医疗保险数据集的数据准备技术进行了综述，并提出了一种改进的数据标记过程。接下来，我们用多达58个新的提供者汇总特征丰富了原始的医疗保险欺诈数据集。最后，我们解决了一个常见的模型评估陷阱，并提出了一种调整后的交叉验证技术，该技术可减轻目标泄漏，以提供可靠的评估结果。使用极端梯度提升和随机森林学习器、多个互补性能指标以及95%置信区间，在医疗保险欺诈分类任务上对每个数据集进行了评估。结果表明，新的丰富数据集始终优于相关工作中目前使用的原始医疗保险数据集。我们的结果鼓励了以数据为中心的机器学习工作流程，并为医疗欺诈中机器学习应用的数据理解和准备技术提供了坚实的基础。