Division of Infectious Diseases, David Geffen School of Medicine at University of California, Los Angeles.
Division of General Internal Medicine, David Geffen School of Medicine at University of California, Los Angeles.
JAMA Netw Open. 2022 Aug 1;5(8):e2225593. doi: 10.1001/jamanetworkopen.2022.25593.
Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports.
To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML).
DESIGN, SETTING, AND PARTICIPANTS: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners' death records was examined.
Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency-inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs.
Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier.
Of 35 433 death records analyzed (decedent median age, 58 years [IQR, 41-72 years]; 24 449 [69%] were male), the most common substances related to deaths included any opioid (5739 [16%]), fentanyl (4758 [13%]), alcohol (2866 [8%]), cocaine (2247 [6%]), methamphetamine (1876 [5%]), heroin (1613 [5%]), prescription opioids (1197 [3%]), and any benzodiazepine (1076 [3%]). The CUI embeddings had similar or better diagnostic metrics compared with word embeddings and TF-IDF for all substances except alcohol. ML classifiers had perfect or near perfect performance in classifying deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal using all 3 feature extraction methods.
In this diagnostic study, NLP/ML algorithms demonstrated excellent diagnostic performance at classifying substances related to overdoses. These algorithms should be integrated into workflows to decrease the lag time in reporting overdose surveillance data.
在美国,药物过量是导致死亡的主要原因之一;然而,从法医确定死亡到向国家监测报告报告,监测数据的滞后相当大。
使用自然语言处理(NLP)和机器学习(ML)自动对法医数据中与物质有关的死亡进行分类。
设计、地点和参与者:诊断研究比较了不同的自然语言处理和机器学习算法,以从美国 10 个卫生管辖区 2020 年 1 月 1 日至 12 月 31 日的医疗检查官和验尸官的 35433 份死亡记录中确定与药物过量有关的物质。检查了 35433 份法医和验尸官死亡记录的非结构化文本。
对每份病例的文本进行手动分类,以确定与死亡有关的物质。使用了三种特征表示方法并进行了比较:文本频率逆文档频率(TF-IDF)、词表示的全局向量(GloVe)和概念唯一标识符(CUI)嵌入。训练了几种 ML 算法,并根据 F 分数选择了最佳模型。在保留测试集上对最佳模型进行了测试,并报告了 95%CI 的结果。
从死亡证明中的文本数据中分类为任何阿片类药物、芬太尼、酒精、可卡因、甲基苯丙胺、海洛因、处方阿片类药物和其他物质的混合物。为每种特征提取方法和机器学习分类器的组合计算了诊断指标和 95%CI。
在所分析的 35433 份死亡记录中(死者中位年龄为 58 岁[IQR,41-72 岁];24449[69%]为男性),与死亡最相关的物质包括任何阿片类药物(5739[16%])、芬太尼(4758[13%])、酒精(2866[8%])、可卡因(2247[6%])、甲基苯丙胺(1876[5%])、海洛因(1613[5%])、处方阿片类药物(1197[3%])和任何苯二氮䓬类药物(1076[3%])。与词嵌入和 TF-IDF 相比,CUI 嵌入在所有物质(除酒精外)的分类中具有相似或更好的诊断指标。ML 分类器在分类与任何阿片类药物、海洛因、芬太尼、处方阿片类药物、甲基苯丙胺、可卡因和酒精有关的死亡方面表现出完美或近乎完美的性能。使用所有 3 种特征提取方法,苯二氮䓬类药物的分类效果都不理想。
在这项诊断研究中,NLP/ML 算法在分类与药物过量有关的物质方面表现出出色的诊断性能。这些算法应整合到工作流程中,以减少报告药物过量监测数据的滞后时间。