Data Science Capstone Project

Applied Data Science Capstone

This notebook will be used for the:

Applied Data Science Capstone Project

In [3]:
#Import the pandas library as pd.
import pandas as pd
In [4]:
#Import the Numpy library as np.
import numpy as np
In [5]:
#Print the following the statement: Hello Capstone Project Course!
print('Hello Capstone Project Course!')
Hello Capstone Project Course!

Introduction | Business Problem

The Seattle government will prevent avoidable car accidents by utiliziing methods to alert drivers to be more careful in critical situations. In most cases, not paying attention during driving, abusing drugs and alcohol or driving at high speeds are the main causes of occurring accidents that can be prevented by enforcing regulations. Besides the aforementioned reasons, weather, visibility, and road conditions are the major uncontrollable factors that can be prevented by revealing hidden patterns in the data and announcing warnings to the local government, police and drivers on the targeted roads. The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance companies. The model and its results are going to provide some advice for the target audience to make insightful decisions for reducing the number of accidents and injuries for the city.

Data

The data was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present. The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident. Additionally, COLLISIONTYPE, describes the type of crash. Other attributes that I will consider but are not crucial are the WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, UNDERINFL

Methodology

I will find a pattern in the accidents using data visualization to find the connections on WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL on the severity of car crashes. Afterwards, I will run a statistical analysis on what accidents are more likely to result in injury.

Evaluation

SEVERITYCODE is either 1, no injury, or 2, injury. I want to create graphs of COLLISIONTYPE, WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL compared to their SEVERITYDESC

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
In [8]:
# Grab CSV File
!wget -O mydata.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
--2020-09-25 20:12:44--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘mydata.csv’

100%[======================================>] 73,917,638  44.7MB/s   in 1.6s   

2020-09-25 20:12:45 (44.7 MB/s) - ‘mydata.csv’ saved [73917638/73917638]

In [9]:
# Load CSV to dataframe
df = pd.read_csv('mydata.csv', index_col=0)
df.head(10)
Out[9]:
X Y OBJECTID INCKEY COLDETKEY REPORTNO STATUS ADDRTYPE INTKEY LOCATION ... ROADCOND LIGHTCOND PEDROWNOTGRNT SDOTCOLNUM SPEEDING ST_COLCODE ST_COLDESC SEGLANEKEY CROSSWALKKEY HITPARKEDCAR
SEVERITYCODE
2 -122.323148 47.703140 1 1307 1307 3502005 Matched Intersection 37475.0 5TH AVE NE AND NE 103RD ST ... Wet Daylight NaN NaN NaN 10 Entering at angle 0 0 N
1 -122.347294 47.647172 2 52200 52200 2607959 Matched Block NaN AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N ... Wet Dark - Street Lights On NaN 6354039.0 NaN 11 From same direction - both going straight - bo... 0 0 N
1 -122.334540 47.607871 3 26700 26700 1482393 Matched Block NaN 4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST ... Dry Daylight NaN 4323031.0 NaN 32 One parked--one moving 0 0 N
1 -122.334803 47.604803 4 1144 1144 3503937 Matched Block NaN 2ND AVE BETWEEN MARION ST AND MADISON ST ... Dry Daylight NaN NaN NaN 23 From same direction - all others 0 0 N
2 -122.306426 47.545739 5 17700 17700 1807429 Matched Intersection 34387.0 SWIFT AVE S AND SWIFT AV OFF RP ... Wet Daylight NaN 4028032.0 NaN 10 Entering at angle 0 0 N
1 -122.387598 47.690575 6 320840 322340 E919477 Matched Intersection 36974.0 24TH AVE NW AND NW 85TH ST ... Dry Daylight NaN NaN NaN 10 Entering at angle 0 0 N
1 -122.338485 47.618534 7 83300 83300 3282542 Matched Intersection 29510.0 DENNY WAY AND WESTLAKE AVE ... Wet Daylight NaN 8344002.0 NaN 10 Entering at angle 0 0 N
2 -122.320780 47.614076 9 330897 332397 EA30304 Matched Intersection 29745.0 BROADWAY AND E PIKE ST ... Dry Daylight NaN NaN NaN 5 Vehicle Strikes Pedalcyclist 6855 0 N
1 -122.335930 47.611904 10 63400 63400 2071243 Matched Block NaN PINE ST BETWEEN 5TH AVE AND 6TH AVE ... Dry Daylight NaN 6166014.0 NaN 32 One parked--one moving 0 0 N
2 -122.384700 47.528475 12 58600 58600 2072105 Matched Intersection 34679.0 41ST AVE SW AND SW THISTLE ST ... Dry Daylight NaN 6079001.0 NaN 10 Entering at angle 0 0 N

10 rows × 37 columns

In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 194673 entries, 2 to 1
Data columns (total 37 columns):
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
SDOT_COLDESC      194673 non-null object
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SDOTCOLNUM        114936 non-null float64
SPEEDING          9333 non-null object
ST_COLCODE        194655 non-null object
ST_COLDESC        189769 non-null object
SEGLANEKEY        194673 non-null int64
CROSSWALKKEY      194673 non-null int64
HITPARKEDCAR      194673 non-null object
dtypes: float64(4), int64(11), object(22)
memory usage: 56.4+ MB
In [11]:
df["SEVERITYDESC"].value_counts().to_frame()
Out[11]:
SEVERITYDESC
Property Damage Only Collision 136485
Injury Collision 58188
In [12]:
df['COLLISIONTYPE'].value_counts().to_frame()
Out[12]:
COLLISIONTYPE
Parked Car 47987
Angles 34674
Rear Ended 34090
Other 23703
Sideswipe 18609
Left Turn 13703
Pedestrian 6608
Cycles 5415
Right Turn 2956
Head On 2024
In [13]:
df['WEATHER'].value_counts().to_frame()
Out[13]:
WEATHER
Clear 111135
Raining 33145
Overcast 27714
Unknown 15091
Snowing 907
Other 832
Fog/Smog/Smoke 569
Sleet/Hail/Freezing Rain 113
Blowing Sand/Dirt 56
Severe Crosswind 25
Partly Cloudy 5
In [14]:
df['ROADCOND'].value_counts().to_frame()
Out[14]:
ROADCOND
Dry 124510
Wet 47474
Unknown 15078
Ice 1209
Snow/Slush 1004
Other 132
Standing Water 115
Sand/Mud/Dirt 75
Oil 64
In [15]:
df['LIGHTCOND'].value_counts().to_frame()
Out[15]:
LIGHTCOND
Daylight 116137
Dark - Street Lights On 48507
Unknown 13473
Dusk 5902
Dawn 2502
Dark - No Street Lights 1537
Dark - Street Lights Off 1199
Other 235
Dark - Unknown Lighting 11
In [16]:
df['INATTENTIONIND'].value_counts().to_frame()
Out[16]:
INATTENTIONIND
Y 29805
In [17]:
df['UNDERINFL'].value_counts().to_frame()
Out[17]:
UNDERINFL
N 100274
0 80394
Y 5126
1 3995
In [18]:
# Creating Filtered Dataframe
#dfs = df.drop('SEVERITYCODE.1', axis=1, inplace=True)
df_Conditions = df.filter(['SEVERITYCODE.1', 'COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'INATTENTIONIND', 'UNDERINFL'], axis = 1) 
df_Conditions.head()
Out[18]:
SEVERITYCODE.1 COLLISIONTYPE WEATHER ROADCOND LIGHTCOND INATTENTIONIND UNDERINFL
SEVERITYCODE
2 2 Angles Overcast Wet Daylight NaN N
1 1 Sideswipe Raining Wet Dark - Street Lights On NaN 0
1 1 Parked Car Overcast Dry Daylight NaN 0
1 1 Other Clear Dry Daylight NaN N
2 2 Angles Raining Wet Daylight NaN 0
In [19]:
# Replacing Y and N with 1 and 0
df_Conditions['UNDERINFL'].replace({"N" : "0", "Y" : "1"}, inplace = True) 
df_Conditions.tail(10)
Out[19]:
SEVERITYCODE.1 COLLISIONTYPE WEATHER ROADCOND LIGHTCOND INATTENTIONIND UNDERINFL
SEVERITYCODE
2 2 Angles Raining Wet Daylight Y 0
1 1 Angles Clear Dry Daylight NaN 0
1 1 Angles Clear Dry Daylight NaN 0
2 2 Angles Clear Wet Daylight NaN 0
1 1 Other Raining Wet Dark - Street Lights On NaN 1
2 2 Head On Clear Dry Daylight NaN 0
1 1 Rear Ended Raining Wet Daylight Y 0
2 2 Left Turn Clear Dry Daylight NaN 0
2 2 Cycles Clear Dry Dusk NaN 0
1 1 Rear Ended Clear Wet Daylight NaN 0
In [20]:
import matplotlib.pyplot as plt
In [21]:
df_Severity = df_Conditions["SEVERITYCODE.1"].value_counts().to_frame()
df_Severity
Out[21]:
SEVERITYCODE.1
1 136485
2 58188

Then, I began choosing columns to use from the dataframe that I created. The columns that I chose were SEVERITYCODE, which assigns a crash a value of 1, which means no injury, and 2, indicating injury, COLLISIONTYPE, which describes the type of crash, WEATHER, which describes the weather at the time of crash, ROADCOND, which describes the condition of the road at the time of crash, LIGHTCOND, which describes the light conditions at the time of crash, INATTENTIONIND, which describes whether the driver was distracted, and UNDERINFL, which describes whether the driver was under the influence.

In [22]:
ax = df_Severity.plot.bar(y = "SEVERITYCODE.1", rot = 0)
In [23]:
from sklearn.utils import resample
In [24]:
df_Conditions.rename(columns={'SEVERITYCODE.1': 'SEVCODE'}, inplace=True)
df_Conditions_maj = df_Conditions[df_Conditions.SEVCODE == 1]
df_Conditions_min = df_Conditions[df_Conditions.SEVCODE == 2]

df_Conditions_maj_dsample = resample(df_Conditions_maj,
                                    replace = False,
                                    n_samples = 58188,
                                    random_state = 123)

df_Conditions_balanced = pd.concat([df_Conditions_maj_dsample, df_Conditions_min])

df_Conditions_balanced.SEVCODE.value_counts()
Out[24]:
2    58188
1    58188
Name: SEVCODE, dtype: int64
In [25]:
# Types of collisions and how many resulted in injury
df_Conditions_balanced.groupby(['COLLISIONTYPE'])['SEVCODE'].value_counts(normalize = True) 
Out[25]:
COLLISIONTYPE  SEVCODE
Angles         2          0.603954
               1          0.396046
Cycles         2          0.947095
               1          0.052905
Head On        2          0.631427
               1          0.368573
Left Turn      2          0.605460
               1          0.394540
Other          1          0.552824
               2          0.447176
Parked Car     1          0.878829
               2          0.121171
Pedestrian     2          0.952962
               1          0.047038
Rear Ended     2          0.640543
               1          0.359457
Right Turn     1          0.621269
               2          0.378731
Sideswipe      1          0.733489
               2          0.266511
Name: SEVCODE, dtype: float64
In [26]:
# Weather conditions at the time of collision and how severe they are
df_Conditions_balanced.groupby(['WEATHER'])['SEVCODE'].value_counts(normalize = True) 
Out[26]:
WEATHER                   SEVCODE
Blowing Sand/Dirt         1          0.500000
                          2          0.500000
Clear                     2          0.527478
                          1          0.472522
Fog/Smog/Smoke            2          0.526761
                          1          0.473239
Other                     1          0.714286
                          2          0.285714
Overcast                  2          0.519484
                          1          0.480516
Partly Cloudy             2          0.750000
                          1          0.250000
Raining                   2          0.542946
                          1          0.457054
Severe Crosswind          2          0.538462
                          1          0.461538
Sleet/Hail/Freezing Rain  1          0.555556
                          2          0.444444
Snowing                   1          0.639241
                          2          0.360759
Unknown                   1          0.880893
                          2          0.119107
Name: SEVCODE, dtype: float64
In [27]:
# Road conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['ROADCOND'])['SEVCODE'].value_counts(normalize = True) 
Out[27]:
ROADCOND        SEVCODE
Dry             2          0.527158
                1          0.472842
Ice             1          0.597345
                2          0.402655
Oil             2          0.615385
                1          0.384615
Other           2          0.518072
                1          0.481928
Sand/Mud/Dirt   1          0.510638
                2          0.489362
Snow/Slush      1          0.683712
                2          0.316288
Standing Water  2          0.526316
                1          0.473684
Unknown         1          0.889950
                2          0.110050
Wet             2          0.536359
                1          0.463641
Name: SEVCODE, dtype: float64
In [28]:
# light conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['LIGHTCOND'])['SEVCODE'].value_counts(normalize = True) 
Out[28]:
LIGHTCOND                 SEVCODE
Dark - No Street Lights   1          0.614319
                          2          0.385681
Dark - Street Lights Off  1          0.550498
                          2          0.449502
Dark - Street Lights On   1          0.503141
                          2          0.496859
Dark - Unknown Lighting   2          0.571429
                          1          0.428571
Dawn                      2          0.536808
                          1          0.463192
Daylight                  2          0.539054
                          1          0.460946
Dusk                      2          0.541203
                          1          0.458797
Other                     1          0.590551
                          2          0.409449
Unknown                   1          0.900198
                          2          0.099802
Name: SEVCODE, dtype: float64
In [29]:
# Filling in the INATTENTIONIND column with 1s and 0s
df_Conditions_balanced['INATTENTIONIND'].replace({"NaN" : "0", "Y" : "1"}, inplace = True) 
df_Conditions_balanced['INATTENTIONIND'] = df_Conditions_balanced['INATTENTIONIND'].fillna(0)
df_Conditions_balanced['INATTENTIONIND'].head(50)
Out[29]:
SEVERITYCODE
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    1
1    0
Name: INATTENTIONIND, dtype: object
In [30]:
# Inattention at time of collision and how severe they are
df_Conditions_balanced.groupby(['INATTENTIONIND'])['SEVCODE'].value_counts(normalize = True) 
Out[30]:
INATTENTIONIND  SEVCODE
0               1          0.510924
                2          0.489076
1               2          0.557211
                1          0.442789
Name: SEVCODE, dtype: float64
In [31]:
# Under influence at time of collision and how severe they are
df_Conditions_balanced.groupby(['UNDERINFL'])['SEVCODE'].value_counts(normalize = True) 
Out[31]:
UNDERINFL  SEVCODE
0          1          0.502326
           2          0.497674
1          2          0.594856
           1          0.405144
Name: SEVCODE, dtype: float64
In [32]:
df_Conditions_balanced.head()
Out[32]:
SEVCODE COLLISIONTYPE WEATHER ROADCOND LIGHTCOND INATTENTIONIND UNDERINFL
SEVERITYCODE
1 1 Angles Raining Wet Dark - Street Lights On 0 0
1 1 Angles Clear Dry Daylight 0 0
1 1 Angles Unknown Unknown Unknown 0 0
1 1 Sideswipe Clear Dry Daylight 0 0
1 1 Head On Clear Dry Daylight 0 0
In [33]:
df_Conditions_balanced['COLLISIONTYPE'].value_counts().plot(kind = 'bar')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a5c0198>
In [34]:
df_Conditions_balanced['WEATHER'].value_counts().plot(kind = 'bar')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a657240>
In [35]:
df_Conditions_balanced['ROADCOND'].value_counts().plot(kind = 'bar')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a5335c0>
In [36]:
df_Conditions_balanced['LIGHTCOND'].value_counts().plot(kind = 'bar')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a40d2b0>
In [37]:
df_Conditions_balanced['INATTENTIONIND'].value_counts().plot(kind = 'bar')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a38dfd0>
In [38]:
df_Conditions_balanced['UNDERINFL'].value_counts().plot(kind = 'bar')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f238a363080>

Conclusion

As the above bar graphs show, most crashes happen in clear, dry, and daylight conditions. Most of these crashes do not result in injury, and the majority happen with drivers in a normal sober state. More people drive on days that are clear, sunny, and bright so it makes sense that there are more car crashes under those conditions. Crashes that happen when a driver is under the influence or not paying attention appear to be more serious in nature and are likely to result in injury.

The results of the data show that city officials should educate drivers to exercise caution and be more careful when driving during so called 'ideal' conditions.

In [ ]:
 

Comments