The benefits of reducing violent and property crime in your neighborhood have a long-term impact on the livelihood of the people you love, and the community you serve. Here is a review of the work I did for MyCityAtPeace, and for-profit organization using a specialized hands on approach to tackling these issues.
Table of Contents
Project needs analysis and implementation plan
If you do a quick Google search, you’ll find alot of useful information on various crime statistics; robbery rate, burglary rate, aggravated assault and much more. However, it is very difficult to find free information on violent and property crime in the US on state and town level.
So here is how I went about gathering the information,
- I consulted with Executive level stakeholder to determine their needs to assist with efficient implementation of strategic and operational resources.
- 9 socio-economic variables were required
- Degree of Violent Crime
- Degree of Property Crime
- Average population town and growth
- Average unemployment of town and growth
- Median real estate value of town and growth
- Percentage cost of living compared to the US average
- The code base was developed in Python and implemented on Google Cloud Platform.
- It was then cleansed using Python, and visualized in Tableau.
The code and Tableau worksheet is available on my Github repository.
Code structure and implementation
Data on these socioeconomic variables isn’t freely available, atleast on the level required. If you’re planning on scraping data, it makes it alot easier if you choose a website that has most of the information you need. Why? Because each website has a unique HTML structure, and will save you from developing multiple other scripts for tailoring the extraction process. Lucky for me, all the data I need could be found on BestPlaces.
Here is the high level process of the script design,
- Required dependencies were uploaded with the usual suspects; pandas, numpy, re, requests and BeautifulSoup
- State level zip codes were imported into a Pandas DataFrame
- The DataFrame was cleansed of NaN’s and initialized to include columns for two urls used to facilitate the collection process.
- The scraping process iterated through zipcodesby making a request to the base url, and appending the zip code as per the structure defined on the website. So here is an example of demographic data for Cambridge. Take note of the url structure.
- Exported the data to a csv file.
Google Cloud Implementation
So if you take the code and get it to run on your computer, it will take about 9 hours to finish. I honestly didn’t have that kind of time to wait in a fixed wifi spot (because if i moved the computer and connection stopped, so was the scraping process).
So I implemented the process on Google’s Computer Engine on it’s Computer Platform (GCP). The cost of running the code on the service ended up being about $3. There are tons of tutorials out there to help you get started with it but here is the overview,
- Created a project on GCP.
- Created a standard “VM Instance” with the following setup,
- Zone: us-central1-a
- Machine Type: n1-standard-1 (1 vCPU, 3.75 GB memory)
- OS: Ubuntu 16.04
- SSH’d into the Virtual Machine (VM), updated and installed the required dependencies including the Python codebase and folder structure.
- Then ran the code on 3 url’s to make sure it was outputting correctly. If it doesn’t run as planned, it might have something to do with the way you’re referencing input and output files. The syntax is different on Linux.
Finally, perhaps the most rewarding part of this process is making sure the code runs when you close the SSH terminal. Here is the tutorial I used to do it. I went to bed, slept for about 4 hours, and the scraping process was done.
The data was imported into Tableau, which automatically identified state as the geographic variable allowing for “maps” visualization. Here are some interesting finds.
From the images above, we see that a higher prevalence of property in the US. Where property crime has higher variability around certain hotspots, violent crime seems to (mostly) be restricted to the hotspots themselves.
From the images above, we see the distribution of the ratio of US violent crime to property crime over two demographics; Average US Cost of Living and Population Growth. By using a ratio, you can isolate states where one type of variable has more of an influence then the other.
- We see a higher prevalence of violent crime around the outskirts of the US.
- We see 6 states that have a higher prevalence of violent crime and positive population growth.
- And we see states around the western region of the US having a higher prevalence of violent crime and higher than average cost of living.