I work at a pretty cool company wherein a lot of really talented folks are pretty accessible. Recently, while trying to figure out an dataset upload error in R that I was pounding my head about, I meandered up to our Chief Data Scientist to lament my n00b-iness. He just so happened to be temporarily stuck and pounding his fists about an RStudio error that was giving him fits.
Witnessing that in the wilds of the work of a bona fide data scientist gave me such a great sense of relief. It was true: even the pros get stuck on seemingly simple data science tasks!
During the Stanford SCI 01 course this past Spring our instructor, Mohammad, preached constantly about the frequency of errors even he received while working to solve problems with various datasets. (But when you’re sitting in class watching a data science pro code, debug, and re-code in real time as students hurl questions/challenges his way, that proclamation is tough to swallow.) The diligence of working through those errors in the context of a project is exactly the interstices of where learning happens. This is one of the many small stepping stones to becoming a data scientist.
Below I share the project my team and I worked on and presented as our final presentation. Before doing so, let me pause to say:
- Data Science is hard
- Data Science is still a relatively nascent field requiring a whole host of interdisciplinary skills
- I am not (yet) a data scientist
We spent 8 intense weeks diving deep into Correlation Analysis, Predictive Modeling, Multiple Linear Regression, Prediction Accuracy, Predictive Modeling Flow, Feature Selection, Classification, Distance Measures, Clustering, Web Scraping, Association Rule Mining, and much more! Even with such a rich syllabus and deep coverage, we still have a ways to go…
If you spend any time reading up on how to become a data scientist you’ll quickly find at least two things. One of those things is: DO DATA SCIENCE PROJECTS.
And that is exactly what this class was all about. Each week we had lab work. And the culmination of the course was presenting our team project. I was lucky enough to team up with two wicked smart women Jamie Castro and Nay Mintin. In the slides below I’ve tried to focus primarily on my contributions to our project. I’ve included a 2-3 slides from Jamie and Nay’s work – and those are labeled in the notes of the slide. Below I provide some additional context about the course and the project. But, first, here is an updated version of what our final presentation looked like:
For the final project each team had to select a dataset to work with. We had been equipped with all of the tools necessary to be able to do linear regression analyses and kmeans clustering. We had been equipped with practice and feedback on leveraging ggplot2 and plot.ly. Armed with this technology, our team selected a school shooting dataset.
One of the first things we quickly realized was: not all datasets are created equally. We likely could have chosen a cleaner (easier?) dataset to work with. Alas, once we chose our bed, we had to sleep in it. The tough thing about the dataset we selected was that it isn’t a clean dataset. It required a ton of clean up before we could even make use of it. This is one of the dirty little secrets about data science: there are selection biases, data and tools limitations, and a plethora of other indirect inputs that affect the outputs from the model(s). During our presentation we didn’t have much time to discuss this fact. But it was probably the most intensive and time consuming of our entire project – inclusive of all the coding errors!
Throughout the class we focused entirely on working in R, which is hands down the friendliest of data science programming languages. The number of open source packages and capabilities available in this toolbox called R is pretty astounding. The approach of our instructor of requiring lab work and sheer repetition of tasks facilitated a relatively quick grasping of the syntax, libraries, etc.
The Final Project:
Our team name was 4Madison. The reason we chose this name – my daughter’s name – is because in the days prior to the start of class there was a school shooting scare near her school. It ended up being an empty threat from a person with some mental instabilities. In addition to it freaking me the hell out, it also was the impetus for our group project:
Could we predict the next school shooting?
This was an audacious – if not totally macabre – grand tour question for us to pursue. But pursue it we did. And in the google slides below you can flip through some of our findings. The approach we took was to divide and conquer the dataset. My focus was on three features: weapon_type, weapon_source, and state. Nay focused on time and date and school demographic data. And Jamie focused on state legislation.
So as not to bury the lead, our predictive profile of what the next school shooting will consist of is this:
The next school shooting is likely to happen on a [Tuesday] at around [11:00am] using a [handgun] that the shooter will have procured from their [parents].
You can find our code in the slides and in some of the slide notes.
While there remains a ton of work to do on this dataset, there are more comprehensive, higher quality school shooting data set analyses out there (e.g. this one). If I could do it over again I likely would not have selected the Washington Post scrubbed version of school shooting data to work with.
It’s a pretty sad dataset to work with. While it was a great learning experience, I’ll be turning my attention to a new dataset with new challenges.