This notebook will be used for the:¶

Applied Data Science Capstone Project¶

#Import the pandas library as pd.
import pandas as pd

#Import the Numpy library as np.
import numpy as np

#Print the following the statement: Hello Capstone Project Course!
print('Hello Capstone Project Course!')

Hello Capstone Project Course!

Introduction | Business Problem¶

The Seattle government will prevent avoidable car accidents by utiliziing methods to alert drivers to be more careful in critical situations. In most cases, not paying attention during driving, abusing drugs and alcohol or driving at high speeds are the main causes of occurring accidents that can be prevented by enforcing regulations. Besides the aforementioned reasons, weather, visibility, and road conditions are the major uncontrollable factors that can be prevented by revealing hidden patterns in the data and announcing warnings to the local government, police and drivers on the targeted roads. The target audience of the project is local Seattle government, police, rescue groups, and last but not least, car insurance companies. The model and its results are going to provide some advice for the target audience to make insightful decisions for reducing the number of accidents and injuries for the city.

Data¶

The data was collected by the Seattle Police Department and Accident Traffic Records Department from 2004 to present. The data consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident. Additionally, COLLISIONTYPE, describes the type of crash. Other attributes that I will consider but are not crucial are the WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, UNDERINFL

Methodology¶

I will find a pattern in the accidents using data visualization to find the connections on WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL on the severity of car crashes. Afterwards, I will run a statistical analysis on what accidents are more likely to result in injury.

Evaluation¶

SEVERITYCODE is either 1, no injury, or 2, injury. I want to create graphs of COLLISIONTYPE, WEATHER, ROADCOND, LIGHTCOND, INATTENTIONIND, and UNDERINFL compared to their SEVERITYDESC

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

# Grab CSV File
!wget -O mydata.csv https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv

--2020-09-25 20:12:44--  https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv
Resolving s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)... 67.228.254.196
Connecting to s3.us.cloud-object-storage.appdomain.cloud (s3.us.cloud-object-storage.appdomain.cloud)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73917638 (70M) [text/csv]
Saving to: ‘mydata.csv’

100%[======================================>] 73,917,638  44.7MB/s   in 1.6s   

2020-09-25 20:12:45 (44.7 MB/s) - ‘mydata.csv’ saved [73917638/73917638]

# Load CSV to dataframe
df = pd.read_csv('mydata.csv', index_col=0)
df.head(10)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194673 entries, 2 to 1
Data columns (total 37 columns):
X                 189339 non-null float64
Y                 189339 non-null float64
OBJECTID          194673 non-null int64
INCKEY            194673 non-null int64
COLDETKEY         194673 non-null int64
REPORTNO          194673 non-null object
STATUS            194673 non-null object
ADDRTYPE          192747 non-null object
INTKEY            65070 non-null float64
LOCATION          191996 non-null object
EXCEPTRSNCODE     84811 non-null object
EXCEPTRSNDESC     5638 non-null object
SEVERITYCODE.1    194673 non-null int64
SEVERITYDESC      194673 non-null object
COLLISIONTYPE     189769 non-null object
PERSONCOUNT       194673 non-null int64
PEDCOUNT          194673 non-null int64
PEDCYLCOUNT       194673 non-null int64
VEHCOUNT          194673 non-null int64
INCDATE           194673 non-null object
INCDTTM           194673 non-null object
JUNCTIONTYPE      188344 non-null object
SDOT_COLCODE      194673 non-null int64
SDOT_COLDESC      194673 non-null object
INATTENTIONIND    29805 non-null object
UNDERINFL         189789 non-null object
WEATHER           189592 non-null object
ROADCOND          189661 non-null object
LIGHTCOND         189503 non-null object
PEDROWNOTGRNT     4667 non-null object
SDOTCOLNUM        114936 non-null float64
SPEEDING          9333 non-null object
ST_COLCODE        194655 non-null object
ST_COLDESC        189769 non-null object
SEGLANEKEY        194673 non-null int64
CROSSWALKKEY      194673 non-null int64
HITPARKEDCAR      194673 non-null object
dtypes: float64(4), int64(11), object(22)
memory usage: 56.4+ MB

df["SEVERITYDESC"].value_counts().to_frame()

df['COLLISIONTYPE'].value_counts().to_frame()

df['WEATHER'].value_counts().to_frame()

df['ROADCOND'].value_counts().to_frame()

df['LIGHTCOND'].value_counts().to_frame()

df['INATTENTIONIND'].value_counts().to_frame()

df['UNDERINFL'].value_counts().to_frame()

# Creating Filtered Dataframe
#dfs = df.drop('SEVERITYCODE.1', axis=1, inplace=True)
df_Conditions = df.filter(['SEVERITYCODE.1', 'COLLISIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'INATTENTIONIND', 'UNDERINFL'], axis = 1) 
df_Conditions.head()

# Replacing Y and N with 1 and 0
df_Conditions['UNDERINFL'].replace({"N" : "0", "Y" : "1"}, inplace = True) 
df_Conditions.tail(10)

import matplotlib.pyplot as plt

df_Severity = df_Conditions["SEVERITYCODE.1"].value_counts().to_frame()
df_Severity

Then, I began choosing columns to use from the dataframe that I created. The columns that I chose were SEVERITYCODE, which assigns a crash a value of 1, which means no injury, and 2, indicating injury, COLLISIONTYPE, which describes the type of crash, WEATHER, which describes the weather at the time of crash, ROADCOND, which describes the condition of the road at the time of crash, LIGHTCOND, which describes the light conditions at the time of crash, INATTENTIONIND, which describes whether the driver was distracted, and UNDERINFL, which describes whether the driver was under the influence.

ax = df_Severity.plot.bar(y = "SEVERITYCODE.1", rot = 0)

from sklearn.utils import resample

df_Conditions.rename(columns={'SEVERITYCODE.1': 'SEVCODE'}, inplace=True)
df_Conditions_maj = df_Conditions[df_Conditions.SEVCODE == 1]
df_Conditions_min = df_Conditions[df_Conditions.SEVCODE == 2]

df_Conditions_maj_dsample = resample(df_Conditions_maj,
                                    replace = False,
                                    n_samples = 58188,
                                    random_state = 123)

df_Conditions_balanced = pd.concat([df_Conditions_maj_dsample, df_Conditions_min])

df_Conditions_balanced.SEVCODE.value_counts()

2    58188
1    58188
Name: SEVCODE, dtype: int64

# Types of collisions and how many resulted in injury
df_Conditions_balanced.groupby(['COLLISIONTYPE'])['SEVCODE'].value_counts(normalize = True)

COLLISIONTYPE  SEVCODE
Angles         2          0.603954
               1          0.396046
Cycles         2          0.947095
               1          0.052905
Head On        2          0.631427
               1          0.368573
Left Turn      2          0.605460
               1          0.394540
Other          1          0.552824
               2          0.447176
Parked Car     1          0.878829
               2          0.121171
Pedestrian     2          0.952962
               1          0.047038
Rear Ended     2          0.640543
               1          0.359457
Right Turn     1          0.621269
               2          0.378731
Sideswipe      1          0.733489
               2          0.266511
Name: SEVCODE, dtype: float64

# Weather conditions at the time of collision and how severe they are
df_Conditions_balanced.groupby(['WEATHER'])['SEVCODE'].value_counts(normalize = True)

WEATHER                   SEVCODE
Blowing Sand/Dirt         1          0.500000
                          2          0.500000
Clear                     2          0.527478
                          1          0.472522
Fog/Smog/Smoke            2          0.526761
                          1          0.473239
Other                     1          0.714286
                          2          0.285714
Overcast                  2          0.519484
                          1          0.480516
Partly Cloudy             2          0.750000
                          1          0.250000
Raining                   2          0.542946
                          1          0.457054
Severe Crosswind          2          0.538462
                          1          0.461538
Sleet/Hail/Freezing Rain  1          0.555556
                          2          0.444444
Snowing                   1          0.639241
                          2          0.360759
Unknown                   1          0.880893
                          2          0.119107
Name: SEVCODE, dtype: float64

# Road conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['ROADCOND'])['SEVCODE'].value_counts(normalize = True)

ROADCOND        SEVCODE
Dry             2          0.527158
                1          0.472842
Ice             1          0.597345
                2          0.402655
Oil             2          0.615385
                1          0.384615
Other           2          0.518072
                1          0.481928
Sand/Mud/Dirt   1          0.510638
                2          0.489362
Snow/Slush      1          0.683712
                2          0.316288
Standing Water  2          0.526316
                1          0.473684
Unknown         1          0.889950
                2          0.110050
Wet             2          0.536359
                1          0.463641
Name: SEVCODE, dtype: float64

# light conditions at time of collision and how severe they are
df_Conditions_balanced.groupby(['LIGHTCOND'])['SEVCODE'].value_counts(normalize = True)

LIGHTCOND                 SEVCODE
Dark - No Street Lights   1          0.614319
                          2          0.385681
Dark - Street Lights Off  1          0.550498
                          2          0.449502
Dark - Street Lights On   1          0.503141
                          2          0.496859
Dark - Unknown Lighting   2          0.571429
                          1          0.428571
Dawn                      2          0.536808
                          1          0.463192
Daylight                  2          0.539054
                          1          0.460946
Dusk                      2          0.541203
                          1          0.458797
Other                     1          0.590551
                          2          0.409449
Unknown                   1          0.900198
                          2          0.099802
Name: SEVCODE, dtype: float64

# Filling in the INATTENTIONIND column with 1s and 0s
df_Conditions_balanced['INATTENTIONIND'].replace({"NaN" : "0", "Y" : "1"}, inplace = True) 
df_Conditions_balanced['INATTENTIONIND'] = df_Conditions_balanced['INATTENTIONIND'].fillna(0)
df_Conditions_balanced['INATTENTIONIND'].head(50)

SEVERITYCODE
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    0
1    1
1    0
1    0
1    0
1    1
1    0
Name: INATTENTIONIND, dtype: object

# Inattention at time of collision and how severe they are
df_Conditions_balanced.groupby(['INATTENTIONIND'])['SEVCODE'].value_counts(normalize = True)

INATTENTIONIND  SEVCODE
0               1          0.510924
                2          0.489076
1               2          0.557211
                1          0.442789
Name: SEVCODE, dtype: float64

# Under influence at time of collision and how severe they are
df_Conditions_balanced.groupby(['UNDERINFL'])['SEVCODE'].value_counts(normalize = True)

UNDERINFL  SEVCODE
0          1          0.502326
           2          0.497674
1          2          0.594856
           1          0.405144
Name: SEVCODE, dtype: float64

df_Conditions_balanced.head()

df_Conditions_balanced['COLLISIONTYPE'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a5c0198>

df_Conditions_balanced['WEATHER'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a657240>

df_Conditions_balanced['ROADCOND'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a5335c0>

df_Conditions_balanced['LIGHTCOND'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a40d2b0>

df_Conditions_balanced['INATTENTIONIND'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a38dfd0>

df_Conditions_balanced['UNDERINFL'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f238a363080>

Conclusion¶

As the above bar graphs show, most crashes happen in clear, dry, and daylight conditions. Most of these crashes do not result in injury, and the majority happen with drivers in a normal sober state. More people drive on days that are clear, sunny, and bright so it makes sense that there are more car crashes under those conditions. Crashes that happen when a driver is under the influence or not paying attention appear to be more serious in nature and are likely to result in injury.

The results of the data show that city officials should educate drivers to exercise caution and be more careful when driving during so called 'ideal' conditions.

	X	Y	OBJECTID	INCKEY	COLDETKEY	REPORTNO	STATUS	ADDRTYPE	INTKEY	LOCATION	...	ROADCOND	LIGHTCOND	PEDROWNOTGRNT	SDOTCOLNUM	SPEEDING	ST_COLCODE	ST_COLDESC	SEGLANEKEY	CROSSWALKKEY	HITPARKEDCAR
SEVERITYCODE
2	-122.323148	47.703140	1	1307	1307	3502005	Matched	Intersection	37475.0	5TH AVE NE AND NE 103RD ST	...	Wet	Daylight	NaN	NaN	NaN	10	Entering at angle	0	0	N
1	-122.347294	47.647172	2	52200	52200	2607959	Matched	Block	NaN	AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N	...	Wet	Dark - Street Lights On	NaN	6354039.0	NaN	11	From same direction - both going straight - bo...	0	0	N
1	-122.334540	47.607871	3	26700	26700	1482393	Matched	Block	NaN	4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST	...	Dry	Daylight	NaN	4323031.0	NaN	32	One parked--one moving	0	0	N
1	-122.334803	47.604803	4	1144	1144	3503937	Matched	Block	NaN	2ND AVE BETWEEN MARION ST AND MADISON ST	...	Dry	Daylight	NaN	NaN	NaN	23	From same direction - all others	0	0	N
2	-122.306426	47.545739	5	17700	17700	1807429	Matched	Intersection	34387.0	SWIFT AVE S AND SWIFT AV OFF RP	...	Wet	Daylight	NaN	4028032.0	NaN	10	Entering at angle	0	0	N
1	-122.387598	47.690575	6	320840	322340	E919477	Matched	Intersection	36974.0	24TH AVE NW AND NW 85TH ST	...	Dry	Daylight	NaN	NaN	NaN	10	Entering at angle	0	0	N
1	-122.338485	47.618534	7	83300	83300	3282542	Matched	Intersection	29510.0	DENNY WAY AND WESTLAKE AVE	...	Wet	Daylight	NaN	8344002.0	NaN	10	Entering at angle	0	0	N
2	-122.320780	47.614076	9	330897	332397	EA30304	Matched	Intersection	29745.0	BROADWAY AND E PIKE ST	...	Dry	Daylight	NaN	NaN	NaN	5	Vehicle Strikes Pedalcyclist	6855	0	N
1	-122.335930	47.611904	10	63400	63400	2071243	Matched	Block	NaN	PINE ST BETWEEN 5TH AVE AND 6TH AVE	...	Dry	Daylight	NaN	6166014.0	NaN	32	One parked--one moving	0	0	N
2	-122.384700	47.528475	12	58600	58600	2072105	Matched	Intersection	34679.0	41ST AVE SW AND SW THISTLE ST	...	Dry	Daylight	NaN	6079001.0	NaN	10	Entering at angle	0	0	N

	SEVERITYDESC
Property Damage Only Collision	136485
Injury Collision	58188

	UNDERINFL
N	100274
0	80394
Y	5126
1	3995

Search This Blog

Data Science

Data Science Capstone Project