Data Science Capstone Project
#Import the pandas library as pd.
import pandas as pd
#Import the Numpy library as np.
import numpy as np
#Print the following the statement: Hello Capstone Project Course!
print('Hello Capstone Project Course!')
Introduction | Business Problem¶
The Seattle government will prevent avoidable car accidents by utiliziing methods to alert drivers to be more careful in critical situations. In most cases, not paying attention during driving, abusing drugs and alcohol or driving at high speeds are the main causes of occurring accidents that can be prevented by enforcing regulations. Besides the aforementioned reasons, weather, visibility, and road conditions are the major uncontrollable factors that can be prevented by revealing hidden patterns in the data and announcing warnings to the local government, police and drivers on the targeted roads. The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance companies. The model and its results are going to provide some advice for the target audience to make insightful decisions for reducing the number of accidents and injuries for the city.
Data¶
The data was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present. The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident. Additionally, COLLISIONTYPE, describes the type of crash. Other attributes that I will consider but are not crucial are the WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, UNDERINFL
Methodology¶
I will find a pattern in the accidents using data visualization to find the connections on WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL on the severity of car crashes. Afterwards, I will run a statistical analysis on what accidents are more likely to result in injury.
Evaluation¶
SEVERITYCODE is either 1, no injury, or 2, injury. I want to create graphs of COLLISIONTYPE, WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL compared to their SEVERITYDESC
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
# Grab CSV File
!wget -O mydata.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
# Load CSV to dataframe
df = pd.read_csv('mydata.csv', index_col=0)
df.head(10)
df.info()
df["SEVERITYDESC"].value_counts().to_frame()
df['COLLISIONTYPE'].value_counts().to_frame()
df['WEATHER'].value_counts().to_frame()
df['ROADCOND'].value_counts().to_frame()
df['LIGHTCOND'].value_counts().to_frame()
df['INATTENTIONIND'].value_counts().to_frame()
df['UNDERINFL'].value_counts().to_frame()
# Creating Filtered Dataframe
#dfs = df.drop('SEVERITYCODE.1', axis=1, inplace=True)
df_Conditions = df.filter(['SEVERITYCODE.1', 'COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'INATTENTIONIND', 'UNDERINFL'], axis = 1)
df_Conditions.head()
# Replacing Y and N with 1 and 0
df_Conditions['UNDERINFL'].replace({"N" : "0", "Y" : "1"}, inplace = True)
df_Conditions.tail(10)
import matplotlib.pyplot as plt
df_Severity = df_Conditions["SEVERITYCODE.1"].value_counts().to_frame()
df_Severity
Then, I began choosing columns to use from the dataframe that I created. The columns that I chose were SEVERITYCODE, which assigns a crash a value of 1, which means no injury, and 2, indicating injury, COLLISIONTYPE, which describes the type of crash, WEATHER, which describes the weather at the time of crash, ROADCOND, which describes the condition of the road at the time of crash, LIGHTCOND, which describes the light conditions at the time of crash, INATTENTIONIND, which describes whether the driver was distracted, and UNDERINFL, which describes whether the driver was under the influence.
ax = df_Severity.plot.bar(y = "SEVERITYCODE.1", rot = 0)
from sklearn.utils import resample
df_Conditions.rename(columns={'SEVERITYCODE.1': 'SEVCODE'}, inplace=True)
df_Conditions_maj = df_Conditions[df_Conditions.SEVCODE == 1]
df_Conditions_min = df_Conditions[df_Conditions.SEVCODE == 2]
df_Conditions_maj_dsample = resample(df_Conditions_maj,
replace = False,
n_samples = 58188,
random_state = 123)
df_Conditions_balanced = pd.concat([df_Conditions_maj_dsample, df_Conditions_min])
df_Conditions_balanced.SEVCODE.value_counts()
# Types of collisions and how many resulted in injury
df_Conditions_balanced.groupby(['COLLISIONTYPE'])['SEVCODE'].value_counts(normalize = True)
# Weather conditions at the time of collision and how severe they are
df_Conditions_balanced.groupby(['WEATHER'])['SEVCODE'].value_counts(normalize = True)
# Road conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['ROADCOND'])['SEVCODE'].value_counts(normalize = True)
# light conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['LIGHTCOND'])['SEVCODE'].value_counts(normalize = True)
# Filling in the INATTENTIONIND column with 1s and 0s
df_Conditions_balanced['INATTENTIONIND'].replace({"NaN" : "0", "Y" : "1"}, inplace = True)
df_Conditions_balanced['INATTENTIONIND'] = df_Conditions_balanced['INATTENTIONIND'].fillna(0)
df_Conditions_balanced['INATTENTIONIND'].head(50)
# Inattention at time of collision and how severe they are
df_Conditions_balanced.groupby(['INATTENTIONIND'])['SEVCODE'].value_counts(normalize = True)
# Under influence at time of collision and how severe they are
df_Conditions_balanced.groupby(['UNDERINFL'])['SEVCODE'].value_counts(normalize = True)
df_Conditions_balanced.head()
df_Conditions_balanced['COLLISIONTYPE'].value_counts().plot(kind = 'bar')
df_Conditions_balanced['WEATHER'].value_counts().plot(kind = 'bar')
df_Conditions_balanced['ROADCOND'].value_counts().plot(kind = 'bar')
df_Conditions_balanced['LIGHTCOND'].value_counts().plot(kind = 'bar')
df_Conditions_balanced['INATTENTIONIND'].value_counts().plot(kind = 'bar')
df_Conditions_balanced['UNDERINFL'].value_counts().plot(kind = 'bar')
Conclusion¶
As the above bar graphs show, most crashes happen in clear, dry, and daylight conditions. Most of these crashes do not result in injury, and the majority happen with drivers in a normal sober state. More people drive on days that are clear, sunny, and bright so it makes sense that there are more car crashes under those conditions. Crashes that happen when a driver is under the influence or not paying attention appear to be more serious in nature and are likely to result in injury.
The results of the data show that city officials should educate drivers to exercise caution and be more careful when driving during so called 'ideal' conditions.
Comments
Post a Comment