du Preez Anli, Bhattacharya Sanmitra, Beling Peter, Bowen Edward
Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, United States of America.
AI Center of Excellence, Deloitte & Touche LLP. New York, NY, United States of America.
Artif Intell Med. 2025 Feb;160:103061. doi: 10.1016/j.artmed.2024.103061. Epub 2024 Dec 28.
Identifying fraud in healthcare programs is crucial, as an estimated 3%-10% of the total healthcare expenditures are lost to fraudulent activities. This study presents a systematic literature review of machine learning techniques applied to fraud detection in health insurance claims. We aim to analyze the data and methodologies documented in the literature over the past two decades, providing insights into research challenges and opportunities.
We identified research studies on health insurance fraud detection using machine learning approaches from databases such as Google Scholar, Springer-Link journals, Elsevier, PubMed, Excerpta Medica Database (EMBASE), Scopus, the Association for Computing Machinery (ACM) Digital Library, and the Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library. We included only articles that presented experimental results of machine learning-based approaches applied to healthcare claims. From the reviewed articles, 137 were selected for the final qualitative and quantitative analyses.
In recent years, there has been a surge in publications centered on the use of machine learning to detect health insurance fraud. Among these studies, those focused on the detection of fraud committed by healthcare providers was the most prevalent, followed by fraud committed by patients. A wide variety of machine learning algorithms are highlighted in these studies, ranging from unsupervised (41 studies) and supervised methods (94 studies), to hybrid approaches (12 studies). While traditional machine learning approaches remain dominant in this research area, the adoption of advanced deep learning techniques is on the rise. Considering the type of healthcare claims data used, 30 studies utilized private data sources, while the rest used publicly available datasets. Data from 16 countries were utilized, with a majority coming from the United States (96 studies), followed by China (11 studies) and Australia (5 studies).
Detecting fraud in healthcare claims using machine learning presents several challenges. These include inconsistent data, absence of data standardization and integration, privacy concerns, and a limited number of labeled fraudulent cases to train models on. Future work should focus on enhancing transparency in data preparation, promoting the sharing of fraud investigation outcomes by authorities, and developing benchmark datasets to enhance accessibility and comparability. Furthermore, innovative techniques in data sampling, feature encoding methods for training machine learning models, and exploring the latest advancements in deep learning can significantly advance research in health insurance fraud detection.
识别医疗保健项目中的欺诈行为至关重要,因为估计有3%-10%的医疗保健总支出因欺诈活动而损失。本研究对应用于健康保险理赔欺诈检测的机器学习技术进行了系统的文献综述。我们旨在分析过去二十年文献中记录的数据和方法,洞察研究挑战与机遇。
我们从谷歌学术、施普林格链接期刊、爱思唯尔、PubMed、医学文摘数据库(EMBASE)、Scopus、美国计算机协会(ACM)数字图书馆以及电气和电子工程师协会(IEEE)Xplore数字图书馆等数据库中识别使用机器学习方法进行健康保险欺诈检测的研究。我们仅纳入了展示应用于医疗保健理赔的基于机器学习方法实验结果的文章。从所审查的文章中,挑选了137篇进行最终的定性和定量分析。
近年来,以使用机器学习检测健康保险欺诈为中心的出版物激增。在这些研究中,关注医疗保健提供者实施的欺诈检测的研究最为普遍,其次是患者实施的欺诈。这些研究中突出了各种各样的机器学习算法,从不监督方法(41项研究)、监督方法(94项研究)到混合方法(12项研究)。虽然传统机器学习方法在该研究领域仍占主导地位,但先进深度学习技术的采用正在增加。考虑到所使用的医疗保健理赔数据类型,30项研究使用了私人数据源,其余使用了公开可用的数据集。利用了来自16个国家的数据,其中大部分来自美国(96项研究),其次是中国(11项研究)和澳大利亚(5项研究)。
使用机器学习检测医疗保健理赔中的欺诈行为存在若干挑战。这些挑战包括数据不一致、缺乏数据标准化和整合、隐私问题以及用于训练模型的标记欺诈案例数量有限。未来的工作应侧重于提高数据准备的透明度,促进当局分享欺诈调查结果,并开发基准数据集以提高可及性和可比性。此外,数据采样的创新技术、训练机器学习模型的特征编码方法以及探索深度学习的最新进展可以显著推进健康保险欺诈检测的研究。