利用多种异构数据源开发机器学习模型来估计美国每周自杀死亡人数。

Development of a Machine Learning Model Using Multiple, Heterogeneous Data Sources to Estimate Weekly US Suicide Fatalities.

机构信息

Department of Computer Science and Engineering, Incheon National University, Incheon, South Korea.

Office of Strategy and Innovation, National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, Georgia.

出版信息

JAMA Netw Open. 2020 Dec 1;3(12):e2030932. doi: 10.1001/jamanetworkopen.2020.30932.

IMPORTANCE

Suicide is a leading cause of death in the US. However, official national statistics on suicide rates are delayed by 1 to 2 years, hampering evidence-based public health planning and decision-making.

OBJECTIVE

To estimate weekly suicide fatalities in the US in near real time.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional national study used a machine learning pipeline to combine signals from several streams of real-time information to estimate weekly suicide fatalities in the US in near real time. This 2-phase approach first fits optimal machine learning models to each individual data stream and subsequently combines predictions made from each data stream via an artificial neural network. National-level US administrative data on suicide deaths, health services, and economic, meteorological, and online data were variously obtained from 2014 to 2017. Data were analyzed from January 1, 2014, to December 31, 2017.

EXPOSURES

Longitudinal data on suicide-related exposures were obtained from multiple, heterogeneous streams: emergency department visits for suicide ideation and attempts collected via the National Syndromic Surveillance Program (2015-2017); calls to the National Suicide Prevention Lifeline (2014-2017); calls to US poison control centers for intentional self-harm (2014-2017); consumer price index and seasonality-adjusted unemployment rate, hourly earnings, home price index, and 3-month and 10-year yield curves from the Federal Reserve Economic Data (2014-2017); weekly daylight hours (2014-2017); Google and YouTube search trends related to suicide (2014-2017); and public posts on suicide on Reddit (2 314 533 posts), Twitter (9 327 472 tweets; 2015-2017), and Tumblr (1 670 378 posts; 2014-2017).

MAIN OUTCOMES AND MEASURES

Weekly estimates of suicide fatalities in the US were obtained through a machine learning pipeline that integrated the above data sources. Estimates were compared statistically with actual fatalities recorded by the National Vital Statistics System.

RESULTS

Combining information from multiple data streams, the machine learning method yielded estimates of weekly suicide deaths with high correlation to actual counts and trends (Pearson correlation, 0.811; P < .001), while estimating annual suicide rates with low error (0.55%).

CONCLUSIONS AND RELEVANCE

The proposed ensemble machine learning framework reduces the error for annual suicide rate estimation to less than one-tenth of that of current forecasting approaches that use only historical information on suicide deaths. These findings establish a novel approach for tracking suicide fatalities in near real time and provide the potential for an effective public health response such as supporting budgetary decisions or deploying interventions.

重要性

自杀是美国的主要死亡原因之一。然而，官方的全国自杀率统计数据要延迟 1 到 2 年，这阻碍了基于证据的公共卫生规划和决策。

目的

实时估算美国每周的自杀死亡人数。

设计、地点和参与者：这项横断面全国性研究使用机器学习管道，结合来自多个实时信息源的信号，实时估算美国每周的自杀死亡人数。该 2 阶段方法首先为每个单独的数据流拟合最佳机器学习模型，然后通过人工神经网络合并每个数据流的预测。来自 2014 年至 2017 年的美国国家行政数据中关于自杀死亡、卫生服务以及经济、气象和在线数据的自杀死亡、自杀意念和企图的国家监测计划、全国自杀预防生命线电话、美国中毒控制中心关于故意自残的电话、联邦储备经济数据的消费者价格指数和季节性调整后的失业率、小时工资、房价指数、3 个月和 10 年期收益率曲线、2014 年至 2017 年的每周日照时间、与自杀相关的谷歌和 YouTube 搜索趋势以及 Reddit 上的自杀公开帖子（2314533 个帖子）、Twitter（2015-2017 年的 9327472 条推文）和 Tumblr（2014-2017 年的 1670378 个帖子）被用作输入数据。

暴露因素

来自多个异质数据流的与自杀相关的纵向数据被纳入：国家综合征监测计划收集的自杀意念和企图的急诊就诊情况（2015-2017 年）；全国自杀预防生命线电话（2014-2017 年）；美国中毒控制中心关于故意自残的电话（2014-2017 年）；联邦储备经济数据的消费者价格指数和季节性调整后的失业率、小时工资、房价指数、3 个月和 10 年期收益率曲线（2014-2017 年）；每周日照时间（2014-2017 年）；与自杀相关的谷歌和 YouTube 搜索趋势（2014-2017 年）；Reddit 上的自杀公开帖子（2314533 个帖子）、Twitter（2015-2017 年的 9327472 条推文）和 Tumblr（2014-2017 年的 1670378 个帖子）。

主要结果和措施

通过整合上述数据源的机器学习管道，获得了美国每周自杀死亡人数的估计值。这些估计值与国家生命统计系统记录的实际死亡人数进行了统计学比较。

结果

通过结合多个数据流的信息，机器学习方法对每周自杀死亡人数的估计与实际数据具有高度相关性（Pearson 相关系数，0.811；P<0.001），同时对年度自杀率的估计误差较低（0.55%）。

结论和相关性

提出的集成机器学习框架将年度自杀率估计的误差降低到目前仅使用自杀死亡历史信息的预测方法的十分之一以下。这些发现为实时跟踪自杀死亡人数建立了一种新方法，并为有效的公共卫生应对措施提供了潜力，例如支持预算决策或部署干预措施。

Development of a Machine Learning Model Using Multiple, Heterogeneous Data Sources to Estimate Weekly US Suicide Fatalities.

机构信息

出版信息

IMPORTANCE

OBJECTIVE

EXPOSURES

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结果和措施

结果

结论和相关性

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献