使用自然语言处理开发和验证机器模型，以对涉及过量死亡的物质进行分类。

Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths.

机构信息

Division of Infectious Diseases, David Geffen School of Medicine at University of California, Los Angeles.

Division of General Internal Medicine, David Geffen School of Medicine at University of California, Los Angeles.

出版信息

JAMA Netw Open. 2022 Aug 1;5(8):e2225593. doi: 10.1001/jamanetworkopen.2022.25593.

DOI:10.1001/jamanetworkopen.2022.25593

PMID:35939303

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9361079/

Abstract

IMPORTANCE

Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports.

OBJECTIVE

To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML).

DESIGN, SETTING, AND PARTICIPANTS: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners' death records was examined.

EXPOSURES

Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency-inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs.

MAIN OUTCOMES AND MEASURES

Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier.

RESULTS

Of 35 433 death records analyzed (decedent median age, 58 years [IQR, 41-72 years]; 24 449 [69%] were male), the most common substances related to deaths included any opioid (5739 [16%]), fentanyl (4758 [13%]), alcohol (2866 [8%]), cocaine (2247 [6%]), methamphetamine (1876 [5%]), heroin (1613 [5%]), prescription opioids (1197 [3%]), and any benzodiazepine (1076 [3%]). The CUI embeddings had similar or better diagnostic metrics compared with word embeddings and TF-IDF for all substances except alcohol. ML classifiers had perfect or near perfect performance in classifying deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal using all 3 feature extraction methods.

CONCLUSIONS AND RELEVANCE

In this diagnostic study, NLP/ML algorithms demonstrated excellent diagnostic performance at classifying substances related to overdoses. These algorithms should be integrated into workflows to decrease the lag time in reporting overdose surveillance data.

摘要

重要性

在美国，药物过量是导致死亡的主要原因之一；然而，从法医确定死亡到向国家监测报告报告，监测数据的滞后相当大。

目的

使用自然语言处理（NLP）和机器学习（ML）自动对法医数据中与物质有关的死亡进行分类。

设计、地点和参与者：诊断研究比较了不同的自然语言处理和机器学习算法，以从美国 10 个卫生管辖区 2020 年 1 月 1 日至 12 月 31 日的医疗检查官和验尸官的 35433 份死亡记录中确定与药物过量有关的物质。检查了 35433 份法医和验尸官死亡记录的非结构化文本。

暴露

对每份病例的文本进行手动分类，以确定与死亡有关的物质。使用了三种特征表示方法并进行了比较：文本频率逆文档频率（TF-IDF）、词表示的全局向量（GloVe）和概念唯一标识符（CUI）嵌入。训练了几种 ML 算法，并根据 F 分数选择了最佳模型。在保留测试集上对最佳模型进行了测试，并报告了 95%CI 的结果。

主要结果和措施

从死亡证明中的文本数据中分类为任何阿片类药物、芬太尼、酒精、可卡因、甲基苯丙胺、海洛因、处方阿片类药物和其他物质的混合物。为每种特征提取方法和机器学习分类器的组合计算了诊断指标和 95%CI。

结果

在所分析的 35433 份死亡记录中（死者中位年龄为 58 岁[IQR，41-72 岁]；24449[69%]为男性），与死亡最相关的物质包括任何阿片类药物（5739[16%]）、芬太尼（4758[13%]）、酒精（2866[8%]）、可卡因（2247[6%]）、甲基苯丙胺（1876[5%]）、海洛因（1613[5%]）、处方阿片类药物（1197[3%]）和任何苯二氮䓬类药物（1076[3%]）。与词嵌入和 TF-IDF 相比，CUI 嵌入在所有物质（除酒精外）的分类中具有相似或更好的诊断指标。ML 分类器在分类与任何阿片类药物、海洛因、芬太尼、处方阿片类药物、甲基苯丙胺、可卡因和酒精有关的死亡方面表现出完美或近乎完美的性能。使用所有 3 种特征提取方法，苯二氮䓬类药物的分类效果都不理想。

结论和相关性

在这项诊断研究中，NLP/ML 算法在分类与药物过量有关的物质方面表现出出色的诊断性能。这些算法应整合到工作流程中，以减少报告药物过量监测数据的滞后时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9b5/9361079/27ce4e28b31e/jamanetwopen-e2225593-g001.jpg

相似文献

Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths.

JAMA Netw Open. 2022 Aug 1;5(8):e2225593. doi: 10.1001/jamanetworkopen.2022.25593.

Changes in Opioid-Involved Overdose Deaths by Opioid Type and Presence of Benzodiazepines, Cocaine, and Methamphetamine - 25 States, July-December 2017 to January-June 2018.

MMWR Morb Mortal Wkly Rep. 2019 Aug 30;68(34):737-744. doi: 10.15585/mmwr.mm6834a2.

Analysis of Urine Drug Test Results From Substance Use Disorder Treatment Practices and Overdose Mortality Rates, 2013-2020.

JAMA Netw Open. 2022 Jun 1;5(6):e2215425. doi: 10.1001/jamanetworkopen.2022.15425.

Epidemiological trends in opioid-only and opioid/polysubstance-related death rates among American Indian/Alaska Native populations from 1999 to 2019: a retrospective longitudinal ecological study.

BMJ Open. 2022 May 2;12(5):e053686. doi: 10.1136/bmjopen-2021-053686.

Rate of Fentanyl Positivity Among Urine Drug Test Results Positive for Cocaine or Methamphetamine.

JAMA Netw Open. 2019 Apr 5;2(4):e192851. doi: 10.1001/jamanetworkopen.2019.2851.

Literal text analysis of poly-class and polydrug overdose deaths in North Carolina, 2015-2019.

Drug Alcohol Depend. 2021 Nov 1;228:109048. doi: 10.1016/j.drugalcdep.2021.109048. Epub 2021 Sep 20.

Charting the fourth wave: Geographic, temporal, race/ethnicity and demographic trends in polysubstance fentanyl overdose deaths in the United States, 2010-2021.

Addiction. 2023 Dec;118(12):2477-2485. doi: 10.1111/add.16318. Epub 2023 Sep 13.

Fentanyl, heroin, and methamphetamine-based counterfeit pills sold at tourist-oriented pharmacies in Mexico: An ethnographic and drug checking study.

Drug Alcohol Depend. 2023 Aug 1;249:110819. doi: 10.1016/j.drugalcdep.2023.110819. Epub 2023 Jun 9.

Identifying and classifying opioid-related overdoses: A validation study.

Pharmacoepidemiol Drug Saf. 2019 Aug;28(8):1127-1137. doi: 10.1002/pds.4772. Epub 2019 Apr 24.

Sociodemographic factors, prescription history and opioid overdose deaths: a statewide analysis using linked PDMP and mortality data.

Drug Alcohol Depend. 2018 Sep 1;190:62-71. doi: 10.1016/j.drugalcdep.2018.05.004. Epub 2018 Jun 13.

引用本文的文献

Clinical applications of large language models in medicine and surgery: A scoping review.

J Int Med Res. 2025 Jul;53(7):3000605251347556. doi: 10.1177/03000605251347556. Epub 2025 Jul 4.

Enhancing Substance Use Detection in Clinical Notes with Large Language Models.

Res Sq. 2025 May 15:rs.3.rs-6615981. doi: 10.21203/rs.3.rs-6615981/v1.

Artificial intelligence approaches for anti-addiction drug discovery.

Digit Discov. 2025 May 13. doi: 10.1039/d5dd00032g.

Explainability Enhanced Machine Learning Model for Classifying Intellectual Disability and Attention-Deficit/Hyperactivity Disorder With Psychological Test Reports.

J Korean Med Sci. 2025 Mar 24;40(11):e26. doi: 10.3346/jkms.2025.40.e26.

Speak and You Shall Predict: Evidence That Speech at Initial Cocaine Abstinence Is a Biomarker of Long-Term Drug Use Behavior.

Biol Psychiatry. 2025 Jul 1;98(1):65-75. doi: 10.1016/j.biopsych.2025.01.009. Epub 2025 Jan 20.

ODD: A Benchmark Dataset for the Natural Language Processing Based Opioid Related Aberrant Behavior Detection.

Proc Conf. 2024 Jun;2024:4338-4359.

Leveraging pooled medical examiner records to surveil complex and emerging patterns of polysubstance use in the United States.

Int J Drug Policy. 2025 Jul;141:104397. doi: 10.1016/j.drugpo.2024.104397. Epub 2024 May 9.

Question-answering system extracts information on injection drug use from clinical notes.

Commun Med (Lond). 2024 Apr 3;4(1):61. doi: 10.1038/s43856-024-00470-6.

Natural Language Processing and Machine Learning to Identify People Who Inject Drugs in Electronic Health Records.

Open Forum Infect Dis. 2022 Sep 12;9(9):ofac471. doi: 10.1093/ofid/ofac471. eCollection 2022 Sep.

本文引用的文献

Trends in and Characteristics of Drug Overdose Deaths Involving Illicitly Manufactured Fentanyls - United States, 2019-2020.

MMWR Morb Mortal Wkly Rep. 2021 Dec 17;70(50):1740-1746. doi: 10.15585/mmwr.mm7050e3.

Literal text analysis of poly-class and polydrug overdose deaths in North Carolina, 2015-2019.

Drug Alcohol Depend. 2021 Nov 1;228:109048. doi: 10.1016/j.drugalcdep.2021.109048. Epub 2021 Sep 20.

COVID-19 and the Drug Overdose Crisis: Uncovering the Deadliest Months in the United States, January‒July 2020.

Am J Public Health. 2021 Jul;111(7):1284-1291. doi: 10.2105/AJPH.2021.306256. Epub 2021 Apr 15.

Trends and Geographic Patterns in Drug and Synthetic Opioid Overdose Deaths - United States, 2013-2019.

MMWR Morb Mortal Wkly Rep. 2021 Feb 12;70(6):202-207. doi: 10.15585/mmwr.mm7006a4.

Emerging Characteristics of Isotonitazene-Involved Overdose Deaths: A Case-Control Study.

J Addict Med. 2021;15(5):429-431. doi: 10.1097/ADM.0000000000000775.

Steep increases in fentanyl-related mortality west of the Mississippi River: Recent evidence from county and state surveillance.

Drug Alcohol Depend. 2020 Nov 1;216:108314. doi: 10.1016/j.drugalcdep.2020.108314. Epub 2020 Sep 28.

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data.

Pac Symp Biocomput. 2020;25:295-306.

Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach.

PLoS One. 2019 Oct 16;14(10):e0223318. doi: 10.1371/journal.pone.0223318. eCollection 2019.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Using natural language processing of clinical text to enhance identification of opioid-related overdoses in electronic health records data.

Pharmacoepidemiol Drug Saf. 2019 Aug;28(8):1143-1151. doi: 10.1002/pds.4810. Epub 2019 Jun 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用自然语言处理开发和验证机器模型，以对涉及过量死亡的物质进行分类。

Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths.

机构信息

Division of Infectious Diseases, David Geffen School of Medicine at University of California, Los Angeles.

Division of General Internal Medicine, David Geffen School of Medicine at University of California, Los Angeles.