Kaggle and Data Science

November 24, 2014

Reflections on data science after a “data science” competition.

When my dad started out as a computer engineer at CNA, programming was very new. Doing a calculation for an actuary involved several months or more and a massive team of people. The process of mining data involved writing calculations in FORTRAN, compiling code onto punch cards, and carefully carting these down the hall to the computer. With any luck, the cart wouldn't tip over and spill the cards of the program everywhere. To debug the program, my father would write separate 'patches', more stacks of cards to be inserted somewhere in the bigger stack of cards. To do a study required many roles that no longer exist. Simple questions could take months of work and dozens of people carefully manipulating giant machines.

Modern technology lowers the barriers to trying new calculations, which is important because we rarely know the exact calculation we need before we’ve tried a few things. It also tightens the feedback loop of an analysis from months down to an afternoon. As an example, when Alinea announced a contest to estimate the number of customers they serve in a year, we tried a few ways of computationally estimating sales traffic and very nearly won the contest, all from an afternoon’s work. The fact that calculations are fast and do not require a lot of overhead gives today’s data scientists the ability to step back from uber technical computational issues and instead specialize in identifying the right problems to solve in the first place. Rather than focus on implementing and debugging FORTRAN routines, modern data scientists can stand on the shoulders of giants and think about the problem area specifically rather than implementing and patching batch processes.

I recently participated in a Kaggle data science competition and was surprised to see that “the home of data science” was actually a throwback to a bygone era of data analysis. The 2014 Schizophrenia Classification Challenge, put on by the IEEE International Workshop for Machine Learning for Signal Processing, was an optimization problem where the goal was to classify whether patients were diagnosed with schizophrenia based on MRI data.

My colleague Laurie Skelly, a neuroscientist who actually studied and might have even helped collect some of data used in the contest, was interested in and knew about the problem. In other words, she was actually interested in asking questions of the data. How were the measurements taken? In what context was the data taken? How had the data been processed? Why were we trying to predict schizophrenia as a binary label when there are a variety of subtypes and severities? There are lots of relevant questions here about why to pose the problem, with particular metrics already selected for evaluation purposes. In the competition, there was no context. There was no problem identification. There was no brainstorming about whether this was the right problem to solve in the first place. There was just some data and a placeholder black box into which each participant had to place their code, not unlike how my father loaded his punched cards into the mainframe.

I entered the competition because I wanted to practice using some classification algorithms that I had been experimenting with for a client. Participating in the competition allowed me to practice my workflow and try out new things in a low pressure environment. It was a great way to spend the afternoon. After experimenting with a few textbook algorithms, I ended up using a support vector machine with no feature selection and made sure to not overfit the model to the data. The code, including my work intermediate work, is up on github.

After using some open source tools and a few hours of work, I was really surprised to discover that I placed in the top 3% (#humblebrag). The top score had a rating of 92%. My rating was 89%. Are these differences substantial or negligible? Is 89% sufficient for the task at hand and remaining resources should be devoted to other things? The answers to these questions are critical to real-life problem solving but this discussion is sadly outside the scope of “data science” as it has come to be defined by Kaggle.

Knowing how to fit and test models is an important part of data science, but not all models are valuable. Data scientists need to be able to see the difference. The “home of data science” houses models, not science.