This project explores a dataset of traffic accidents to identify the primary factors contributing to crash severity. By analysing road infrastructure, and environmental conditions, we aim to uncover insights into why certain accidents result in higher severity levels (1–4). In doing so we can predict which conditions result in more severe crashes and what can be done to improve road safety.
- README
- Project board
- Raw data | Clean data
- ETL Notebook
- Visualisation & Statistical analysis Notebook
- Machine Learning Notebook
- Power BI dashboard
- Conclusion and discussion
- Project Overview
- Dataset Content
- Business Requirements
- Hypothesis Testing and Validation
- Project plan
- Rationale to map business requirements
- Data Analysis Methods Used
- Deployment
- Dashboard design
- Main Data Analysis Libraries
- Conclusion and discussion
- Limitations in the data
- Unfixed bugs
- Development Roadmap
- Credits
- Acknowledgements
| Category | Field | Description | Data Type |
|---|---|---|---|
| Target | severity |
The severity of the accident (1 to 4), indicating the impact on traffic. | int64 |
| Temporal | time |
Full timestamp of the accident occurrence. | datetime |
year |
The calendar year of the crash. | int64 |
|
month |
The month of the crash (January - December). | category |
|
day |
The day of the month. | int64 |
|
hour |
The hour of the day (0-23) for rush-hour analysis. | int32 |
|
sunrise_sunset |
Indicates whether the accident occurred during the Day or Night. | object |
|
| Location | start_lat |
The GPS latitude of the accident start point. | float64 |
start_lng |
The GPS longitude of the accident start point. | float64 |
|
city |
The city where the accident occurred. | object |
|
county |
The county where the accident occurred. | object |
|
state |
The US state where the accident occurred. | object |
|
| Weather | weather |
Generalised weather category. | object |
weather_condition |
Detailed meteorological description (e.g., Fair, Cloudy, Rain). | object |
|
temperature(f) |
The ambient temperature in Fahrenheit. | float64 |
|
humidity(%) |
The percentage of humidity in the air. | float64 |
|
visibility(mi) |
Distance visible in miles at the time of the crash. | float64 |
|
precipitation(in) |
Amount of rain or snow recorded in inches. | float64 |
|
wind_speed(mph) |
The speed of the wind at the time of the event. | float64 |
|
| Road Features | crossing |
Presence of a pedestrian crossing nearby (True/False). | bool |
junction |
Indicates if the accident occurred at or near a junction (True/False). | bool |
|
traffic_signal |
Indicates if a traffic signal was present (True/False). | bool |
|
| Details | description |
A brief textual description of the accident event. | object |
As a local council in America we want to study crash data across the country to determine which road features (junctions, crossings, or signals) are most frequently associated with high-severity crashes. As well as studying the time of day, or year when crashes are taking place to see whether we can put in any preventative measures. Are accidents happening more at night? Are accidents around crossings more frequent than those not near crossings? Does bad weather cause the increase in number of severity of crashes that we think it does?
Hypothesis 1: More crashes, of a higher severity take place at night, than they do during the day regardless of the weather.
H₀ - There will be no relationship between severity of crashes and time of day
H₁ - The highest proportion of severe will be at night proving there is a relationship between time of day and intensity of crash
This will be tested using: A chi-squared test to understand the statistical analysis followed by plotting the data.
Result: We reject the Null Hypothesis. There is a significant relationship between time of day and severity. The chi-squared test supports this return a pval of 0.0001210779, well below the 0.05 threshold. The visual plots also support this, whilst there are more crashes during the day and the majority of accidents are given a severity rating of two the largest change in percentage between day and night is category 4 crashes increasing by 106%.
Hypothesis 2: Accident severity changes depending on whether an accident occurs at a junction.
-
Null Hypothesis (H0):
Accident severity does not differ between junction and non-junction locations. -
Alternative Hypothesis (H1):
Accident severity differs between junction and non-junction locations.
- Dataset:
us_accidents_cleaned.csv - Key variables:
severityjunction(True / False)
The data was pre-cleaned as part of the shared ETL process.
- Loaded the cleaned dataset in a Jupyter Notebook.
- Grouped accident records by junction status.
- Calculated average accident severity for:
- Junction locations
- Non-junction locations
- Created visualizations to compare severity levels.
- Interpreted the visual results to evaluate the hypothesis.
Bar charts were used to compare the average accident severity between junction and non-junction locations.
These visualizations helped identify patterns and differences without applying complex statistical models.
- Accidents occurring at junctions show a different average severity compared to non-junction locations.
- The visual comparison suggests that junction presence is associated with changes in accident severity.
Based on the exploratory analysis and visual evidence, accident severity appears to vary depending on whether an accident occurs at a junction.
This supports the alternative hypothesis (H1) and suggests that junctions play an important role in accident risk and outcomes.
- Junction areas should receive additional safety measures, such as improved signage and traffic control.
- Further analysis could include:
- Weather conditions
- Time of day
- Traffic volume
- Applying statistical tests or machine learning models could strengthen the findings in future work.
Hypothesis 3: The Midwest shows moderate accident severity compared to those in the coastal regions
High-Level Steps, The dataset was collected from Kaggle and reviewed to understand its structure and variables., Data cleaning and preparation were performed, including handling missing values and creating day and night categories. Exploratory Data Analysis (EDA) was conducted to identify patterns in accident severity., Hypotheses were tested using statistical methods (chi-squared test)., Results were interpreted to generate insights and recommendations.,
Data Management, Data was sourced from Kaggle and managed in a structured manner throughout the project., Cleaning and processing ensured data consistency and accuracy., Data was grouped and analysed based on key variables such as time of day and junction presence., Visualisations were used to support interpretation and present findings clearly.,
Research Methodologies, Exploratory Data Analysis (EDA) was used to understand patterns in the data., Statistical analysis (chi-squared test) was applied to validate the hypotheses., Data visualisation was performed using Power BI to clearly communicate insights.
The analysis followed a the best practice that we have learnt throughout the course. By examining the dataset of over 4,800 US traffic accidents, the initial stage focused on understanding the distribution of crash severity and identifying critical environmental and infrastructure variables. Data management was central to the process; raw data was cleaned and transformed using Pandas to ensure that temporal features—such as peak hours and seasonal months—were accurately extracted. Categorical variables and boolean flags for road features were encoded to facilitate both statistical correlation and machine learning compatibility, ensuring a high level of data integrity from collection through to interpretation.
The research methodology was designed to provide a multi-dimensional view of road safety. Quantitative analysis was employed to pinpoint how fluctuations in visibility, precipitation, and temperature correlate with accident frequency. This was complemented by a comparative infrastructure study, which isolated the impact of junctions and traffic signals on the resulting severity of a crash. To provide actionable insights for stakeholders, the project utilised supervised machine learning—specifically Decision Tree Classifiers—to rank feature importance. This "white-box" approach was chosen to ensure the findings remained transparent and interpretable for local councils and urban planners. Finally, geospatial mapping was integrated into the workflow to identify physical "black spots," transforming tabular data into a visual narrative that highlights high-risk locations across the United States.
To ensure the analysis provided actionable value, the following requirements were established to map data insights to real-world local council applications:
- Infrastructure Safety: Identifying high-risk road features to prioritised engineering audits.
- Environmental Policy: Determining visibility and precipitation thresholds for dynamic speed limit warnings.
The project utilised multiple visualisation strategies:
- Statistical (Seaborn/Matplotlib): Used to identify correlations between environmental factors and crash severity.
- Interactive (Plotly and Power BI dashboards): Implemented to allow stakeholders to explore geographic "black spots" across the United States.
- Machine Learning: We were able to create a model which was able to judge the severity of potential crashes based on weather conditions and road infrastructure.
- Textual (WordCloud): Employed to analyse accident descriptions for recurring human factors or situational themes.
The analysis was structured around the CRISP-DM (Cross Industry Standard Process for Data Mining) framework, ensuring a systematic transition from business understanding to data preparation and modelling. We started the analysis by inspecting, cleaning and transforming it. Once the dataset was loaded, an initial assessment revealed that there were ~7.7 million records. This was far too many for our project and was also over the github data limit. We opted to take a sample of 5000 records from the original set, this would be enough to work with even taking into account the fact that we may lose some records in the cleaning phase. We handled missing values and performed data type conversions, extracting day, month, year from the date/time to allow for better analysis and visualisation. Boolean road features such as junction, crossing and traffic_signal were kept as they were, while environmental metrics like temperature and visibility were checked for outliers. This stage was critical in transforming the raw data into a structured format suitable for high-fidelity statistical exploration. Throughout this preparation phase, we utilised an Agile approach, favouring small iterations with constant evaluation to refine our variables. For example, we initially found the weather_condition column too fragmented to provide clear insights; so we chose to simplify it into 5 categorical values from the original 44.
Following the preparation of the data, we transitioned into machine learning and predictive modelling. Using Scikit-Learn, we developed a pipeline that integrated feature scaling and categorical encoding. The data was scaled using the StandardScaler to ensure that all features were on a comparable scale, which prevents features with larger numerical ranges from dominating the training process. A decision tree classifier was implemented to investigate the drivers of crash severity. A decision tree classifier was chosen for its transparency, easy to interpret the outputs and it can handle both numerical and categorical features. A decision tree is suitable for identifying the most influential variables in predicting adoption outcomes. By splitting the data into training and testing sets, we were able to validate the model's performance through accuracy scores and confusion matrices, ensuring the insights were statistically sound before moving to the visualisation stage. his is an initial model, with further testing a more tuned model or different classifier will provide more accurate predictions.
The final phase of the project focused on data visualisation using a multi-library approach to communicate findings to different stakeholders. Seaborn and Matplotlib were utilised for statistical plots, such as correlation heatmaps and distribution charts, which quantified the relationship between weather conditions and accident severity. To provide a more intuitive experience for local councils, we used Plotly to create interactive charts. Thanks to power BI we were also able to make a couple of intervative dashboards where we mapped accident coordinates across the United States, allowing for a dynamic exploration of accident "hotspots." This combination of static statistical evidence and interactive mapping ensures that the results are both scientifically rigorous and accessible for urban planning decisions.
Power BI dashboards can be found here
US Car Accidents
The single-page dashboard integrates multiple interactive and analytical components:
Filters and Controls Year, Month, Day, Time Checkboxes: Selectable filters for years 2016-2023 Search Bar: Natural language query input with suggestions
Data Visualisations and Widgets
Rows: Weather conditions (Cloudy, Clear) Columns: Precise Monthly accident counts Widget Type: Interactive table with sorting and filtering
Widget Type: Key influencers visual Insight: “Clear” weather increases average severity by 0.21 Purpose: Highlights statistically significant factors affecting accident severity
Widget Type: Interactive map with scalable blue circular markers Data Source: OSM and TomTom via Microsoft Azure Function: Visuals severity distribution geographically
Communicating Insights Search Bar with Query Suggestions: Enables SQL-like or natural language queries for deeper data exploration.
Key Influencers Analysis: Uses statistical modelling to show causal relationships.
Detailed Monthly Breakdown Table: Offers granular data for time-series analysis and modelling.
Interactive Map: Visually intuitive representation of severity by state. Checkbox Filters: Simple UI for narrowing down data by time period. Weather Condition Table: Uses plain language and colour-coded cells to highlight trends.
The project dashboard includes multiple visual components created by the team to analyse accident severity from different perspectives. These include:
Accident Severity Distribution Visuals showing the overall distribution of accident severity levels.
Severity by Junction Presence Bar charts comparing average accident severity at junctions versus non-junction locations.
Time-Based Analysis Visuals showing accident counts by hour of the day. Comparisons between day and night accidents.
Weather and Time Interaction Charts illustrating how accident frequency varies by weather conditions and time of day.
Geographical and Influencer Analysis (Dashboard)
- Interactive Power BI dashboard including:
- Filters for year and time
- Tables summarising accident metrics
- Key influencer visuals
- Map showing accident severity by state
Dashboard Planning and Iteration The dashboard design evolved during the project. Initial plans focused on basic charts, but were later refined to include interactive elements such as filters, maps, and key influencer visuals. These changes were made to improve clarity and allow deeper exploration of the data as new insights emerged.
Communication of Data Insights For non-technical audiences, insights were communicated using clear visuals, summaries, and interactive dashboards without requiring statistical knowledge. For technical audiences, detailed charts, statistical results, and breakdowns by variables such as time, weather, and junction presence were provided to support deeper analysis.
Dashboard Design for Different Audiences The dashboard was designed to simplify complex accident data by using: Clear titles and labels Visual comparisons instead of raw tables Interactive filters to customise views
This approach ensures accessibility for non-technical users while still supporting detailed analysis for technical users.
| Category | Library | Purpose |
|---|---|---|
| Data Handling | pandas |
Data cleaning, manipulation, and structural analysis. |
numpy |
Numerical operations and array handling. | |
| Visualisation | seaborn |
Statistical data plotting and heatmaps. |
matplotlib |
Core plotting engine for custom static charts. | |
plotly |
Interactive charts and geospatial Mapbox visualisations. | |
wordcloud |
Text analysis for accident descriptions. | |
| Preprocessing | sklearn.preprocessing |
Feature scaling and categorical encoding (OneHot/Label). |
sklearn.compose |
Managing transformations across different column types. | |
| Machine Learning | sklearn.model_selection |
Data splitting and cross-validation. |
sklearn.tree |
Decision Tree classifier implementation. | |
sklearn.feature_selection |
Identifying key variables affecting accident severity. | |
sklearn.pipeline |
Building reproducible end-to-end ML workflows. | |
sklearn.metrics |
Performance metrics (Accuracy, F1-Score, Confusion Matrix). |
This analysis of US traffic accidents provides a comprehensive overview of the factors that contribute to road incidents. By having data from all across America, environmental conditions, several significant patterns have been identified:
-
Seasonal Trends: The data highlights specific high-risk windows, particularly during morning and evening rush hours. Furthermore, seasonal fluctuations suggest a rise in incidents during the winter months, due to reduced daylight and adverse weather.
-
Environmental Impact: Weather conditions causing reduced visibility show a clear correlation with increased accident severity. This underscores the critical need for improved infrastructure and real-time driver warnings during inclement weather.
-
Geographic Hotspots: Certain states and cities have disproportionately high accident rates. The findings suggest that urban density and the complexity of road networks in these regions are primary contributors to the frequency of collisions.
-
Severity Factors: Through the examination of road features, such as junctions and traffic signals, it is evident that specific physical layouts are more prone to serious accidents, providing a roadmap for potential urban planning improvements.
In summary, while driver behaviour is a key variable, environmental and structural factors play a fundamental role in road safety. This project serves as a foundation for data-driven strategies aimed at reducing traffic fatalities and improving emergency response efficiency across the United States.
Predictive modelling using a Decision Tree Classifier shows that the most important features are: Start location, temperature, year, humidity, day, time, wind and traffic crossing.
The model could be improved with further adjusting hyperparameters. Perhaps a different type of Classifier might improve the classification scores.
Given a set of conditions the model can predict the severity of a traffic accident (provided no unforeseen variables are used).
Overall accuracy: ~0.67 Not too good but accuracy is misleading because the classes of severity are highly imbalanced.
Classification Report:
- Class 2 dominates, so the model mostly predicts class 2.
- Classes 1, 3, 4 are severely under-predicted → recall is very low.
- Macro F1 (~0.30) shows that minority classes are failing.
Confusion matrix:
- Most misclassified samples of class 3 and 4 are predicted as class 2.
- There was a lack of distribution amongst severity of the crashes, over 80% had a rating of 2
- We were unable to plot a heatmap due to the fact that all entries are crashes so we had to switch our focus to what causes the severity of each crash.
There were no bugs encountered throughout this project
With the Decision Tree ML model different hyperparameters have been tested. Next it could prove beneficial to try a Random Forest Classifier to improve the accuracy score. The severity target variable is imbalanced which is making it hard for the model to predict accurately for targets other than a severity of 2. To delve deeper into the data the three different twilight columns could be investigated to see if the reducing light levels has a noticeable effect on traffic accidents.
There were no ethical considerations required. All the data in this dataset was anonymised, no personal details have been included.
Data sourced from Kaggle https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
Please note the following two papers:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
Provenance: “This dataset was collected in real-time using multiple Traffic APIs. It contains accident data collected from February 2016 to March 2023 for the Contiguous United States. This dataset is being distributed solely for research purposes under the Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0).”
- Beth, Hidaia, Michael and Steven for creating this project
- Code institute course facilitator Emma Lamont
- Code institute tutors and technical coaches






