Suppr超能文献

通过对印度在线媒体文章进行自然语言处理提取的致命道路交通事故属性数据集。

Dataset on fatal road traffic crash attributes extracted via natural language processing of online media articles in India.

作者信息

Ashutosh Ashutosh, Chand Sai

机构信息

Transportation Research and Injury Prevention Centre (TRIP Centre), Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.

出版信息

Data Brief. 2025 Apr 23;60:111578. doi: 10.1016/j.dib.2025.111578. eCollection 2025 Jun.

Abstract

Road traffic crashes are among the leading causes of death globally, resulting in substantial social and economic impacts. Online media is a key source of public information on road safety. Understanding how crashes are reported is crucial for detecting potential reporting biases and enhancing safety awareness. Hence, to address the issue of the lack of high-quality, media-reported fatal crash data, fatal crash reports were extracted for 2022-2023 from The Times of India, a prominent Indian news outlet. The resulting dataset comprised 2898 fatal crashes, 6584 fatalities and 7812 injuries, including 16 detailed crash attributes. This dataset was developed using web scraping and natural language processing (NLP) techniques. Automated tools such as Selenium and BeautifulSoup were employed to extract raw data from the news source. NLP algorithms were then applied to identify key crash attributes, including crash date, location, vehicles involved and number of fatalities. This study provides a replicable framework for constructing robust datasets from media sources, enabling multidisciplinary research on transportation safety, media reporting and public perception of crashes. The dataset is expected to serve as a valuable resource for analysing how the media shapes road safety narratives and for investigations on identifying high-fatality crash locations.

摘要

道路交通事故是全球主要死因之一,会造成巨大的社会和经济影响。网络媒体是道路安全公共信息的关键来源。了解事故如何被报道对于发现潜在的报道偏差和提高安全意识至关重要。因此,为了解决缺乏高质量、媒体报道的致命事故数据这一问题,从印度著名新闻媒体《印度时报》中提取了2022 - 2023年的致命事故报告。所得数据集包含2898起致命事故、6584人死亡和7812人受伤,包括16个详细的事故属性。该数据集是使用网络爬虫和自然语言处理(NLP)技术开发的。使用Selenium和BeautifulSoup等自动化工具从新闻源中提取原始数据。然后应用NLP算法来识别关键事故属性,包括事故日期、地点、涉及车辆和死亡人数。本研究提供了一个可复制的框架,用于从媒体来源构建强大的数据集,从而能够对交通安全、媒体报道和公众对事故的认知进行多学科研究。该数据集有望成为分析媒体如何塑造道路安全叙事以及识别高死亡率事故地点调查的宝贵资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef32/12098169/673da64eb25c/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验