Ashutosh Ashutosh, Chand Sai
Transportation Research and Injury Prevention Centre (TRIP Centre), Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.
Data Brief. 2025 Apr 23;60:111578. doi: 10.1016/j.dib.2025.111578. eCollection 2025 Jun.
Road traffic crashes are among the leading causes of death globally, resulting in substantial social and economic impacts. Online media is a key source of public information on road safety. Understanding how crashes are reported is crucial for detecting potential reporting biases and enhancing safety awareness. Hence, to address the issue of the lack of high-quality, media-reported fatal crash data, fatal crash reports were extracted for 2022-2023 from The Times of India, a prominent Indian news outlet. The resulting dataset comprised 2898 fatal crashes, 6584 fatalities and 7812 injuries, including 16 detailed crash attributes. This dataset was developed using web scraping and natural language processing (NLP) techniques. Automated tools such as Selenium and BeautifulSoup were employed to extract raw data from the news source. NLP algorithms were then applied to identify key crash attributes, including crash date, location, vehicles involved and number of fatalities. This study provides a replicable framework for constructing robust datasets from media sources, enabling multidisciplinary research on transportation safety, media reporting and public perception of crashes. The dataset is expected to serve as a valuable resource for analysing how the media shapes road safety narratives and for investigations on identifying high-fatality crash locations.
道路交通事故是全球主要死因之一,会造成巨大的社会和经济影响。网络媒体是道路安全公共信息的关键来源。了解事故如何被报道对于发现潜在的报道偏差和提高安全意识至关重要。因此,为了解决缺乏高质量、媒体报道的致命事故数据这一问题,从印度著名新闻媒体《印度时报》中提取了2022 - 2023年的致命事故报告。所得数据集包含2898起致命事故、6584人死亡和7812人受伤,包括16个详细的事故属性。该数据集是使用网络爬虫和自然语言处理(NLP)技术开发的。使用Selenium和BeautifulSoup等自动化工具从新闻源中提取原始数据。然后应用NLP算法来识别关键事故属性,包括事故日期、地点、涉及车辆和死亡人数。本研究提供了一个可复制的框架,用于从媒体来源构建强大的数据集,从而能够对交通安全、媒体报道和公众对事故的认知进行多学科研究。该数据集有望成为分析媒体如何塑造道路安全叙事以及识别高死亡率事故地点调查的宝贵资源。