Python: Analyze policing activity with pandas
This blog post is continuity of my learning journey of data scientist
with python through Datacamp. You can see detail illustration and
execution of following scenarios at my kaggle workspace & github. For original course kindly refer here
Preparing the data for analysis
Examining the dataset
# Import the pandas library as pd
import pandas as pd
# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv("police.csv")
# Examine the head of the DataFrame
print(ri.head())
# Count the number of missing values in each column
print(ri.isnull().sum())
Dropping column
# Examine the shape of the DataFrame
print(ri.shape)
# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)
# Examine the shape of the DataFrame (again)
print(ri.shape)
Dropping rows
# Count the number of missing values in each column
print(ri.isnull().sum())
# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)
# Count the number of missing values in each column (again)
print(ri.isnull().sum())
# Examine the shape of the DataFrame
print(ri.shape)
Fixing a data type
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())
# Check the data type of 'is_arrested'
print(ri.is_arrested.dtype)
# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')
# Check the data type of 'is_arrested'
print(ri.is_arrested.dtype)
Creating datetime index
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time,sep=' ')
# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)
# Examine the data types of the DataFrame
print(ri.dtypes)
Setting the index
Exploring the relationship between gender and policing
Examining traffic violations
# Count the unique values in 'violation'
print(ri.violation.value_counts())
# Express the counts as proportions
print(ri.violation.value_counts(normalize=True))
Comparing violations by gender
# Create a DataFrame of female drivers
female = ri[ri['driver_gender']=='F']
# Create a DataFrame of male drivers
male = ri[ri['driver_gender']=='M']
# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))
# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))
Comparing speeding outcomes by gender
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation=='Speeding')]
# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation=='Speeding')]
# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))
# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))
Calculating search rate
# Check the data type of 'search_conducted'
print(ri.search_conducted.dtype)
# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))
# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())
Great! It looks like the search rate is about 3.8%. Next, you'll examine whether the search rate varies by driver gender.
Comparing search rates by gender
# Calculate the search rate for female drivers
print(ri[ri.driver_gender =='F'].search_conducted.mean())
# Calculate the search rate for male drivers
print(ri[ri.driver_gender =='M'].search_conducted.mean())
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())
Above result shows male drivers are searched more than twice as often as female drivers. Why might this be?
Adding a second factor to the analysis
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender','violation']).search_conducted.mean())
For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.
Counting protective frisks
# Count the 'search_type' values
print(ri.search_type.value_counts())
# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)
# Check the data type of 'frisk'
print(ri.frisk.dtype)
# Take the sum of 'frisk'
print(ri.frisk.sum())
It looks like there were 303 drivers who were frisked. Next, we will examine whether gender affects who is frisked.
Comparing frisk rates by gender
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]
# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())
# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())
Visual exploratory data analysis
Calculating the hourly arrest rate
# Calculate the overall arrest rate
print(ri.is_arrested.mean())
# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())
# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
Plotting the hourly arrest rate
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()
# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')
# Display the plot
plt.show()
The arrest rate has a significant spike overnight, and then dips in the early morning hours.
Plotting drug-related stops
# Calculate the annual rate of drug-related stops
print(ri.drugs_related_stop.resample('A').mean())
# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample('A').mean()
# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()
# Display the plot
plt.show()
Comparing drug and search rates
# Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample('A').mean()
# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate,annual_search_rate], axis='columns')
# Create subplots from 'annual'
annual.plot(subplots=True)
# Display the subplots
plt.show()
Tallying violations by district
# Create a bar plot of 'k_zones'
k_zones.plot(kind='bar')
# Display the plot
plt.show()
# Create a stacked bar plot of 'k_zones'
k_zones.plot(kind='bar',stacked=True)
# Display the plot
plt.show()
Converting stop durations to numbers
# Print the unique values in 'stop_duration'
print(ri.stop_duration.unique())
# Create a dictionary that maps strings to integers
mapping = {'0-15 Min':8,'16-30 Min':23,'30+ Min':45}
# Convert the 'stop_duration' strings to integers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)
# Print the unique values in 'stop_minutes'
print(ri.stop_minutes.unique())
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
print(ri.groupby('violation_raw').stop_minutes.mean())
# Save the resulting Series as 'stop_length'
stop_length = ri.groupby('violation_raw').stop_minutes.mean()
# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind='barh')
# Display the plot
plt.show()
Analyzing the effect of weather on policing
Plotting the temperature
# Read 'weather.csv' into a DataFrame named 'weather'
weather=pd.read_csv("weather.csv")
# Describe the temperature columns
print(weather[['TMIN','TAVG','TMAX']].describe())
# Create a box plot of the temperature columns
weather.plot(kind='box')
# Display the plot
plt.show()
Preparing the DataFrames
Merging the DataFrames
# Examine the shape of 'ri'
print(ri.shape)
# Merge 'ri' and 'weather_rating' using a left join
ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left')
# Examine the shape of 'ri_weather'
print(ri_weather.shape)
# Set 'stop_datetime' as the index of 'ri_weather'
ri_weather.set_index('stop_datetime', inplace=True)
Comparing arrest rates by weather rating
# Calculate the overall arrest rateprint(ri_weather.is_arrested.mean()) # Calculate the arrest rate for each 'rating'
print(ri_weather.groupby('rating').is_arrested.mean())
# Calculate the arrest rate for each 'violation' and 'rating'
print(ri_weather.groupby(['violation','rating']).is_arrested.mean())
Selecting from a multi-indexed Series
# Save the output of the groupby operation from the last exercise
arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
# Print the 'arrest_rate' Series
print(arrest_rate)
# Print the arrest rate for moving violations in bad weather
print(arrest_rate.loc['Moving violation','bad'])
# Print the arrest rates for speeding violations in all three weather conditions
print(arrest_rate.loc['Speeding'])
Reshaping the arrest rate data
# Unstack the 'arrest_rate' Series into a DataFrame
print(arrest_rate.unstack())
# Create the same DataFrame using a pivot table
print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
If You Enjoyed This, Take 5 Seconds To Share It
0 comments:
Post a Comment