Creative thinking and data science
October 24, 2012
There has been a lot of chatter over the past few years about the next industrial revolution—“big data” will transform how companies do business and redefine the competitive landscape across industries. Executives across the world, heeding this advice, have been investing heavily in “big data” and “data scientists”. Oftentimes, however, the heavy investment in “big data” hasn’t been living up to its billing. Why is this the case? Why are companies not able to maximize the value from their data when data is, by and large, regarded as extremely valuable?
Perhaps it is a lack of tools; is the technology there to handle these data sets and perform the necessary analytic horsepower to solve our issues? The short answer is yes, please see exhibits hadoop, r, python, and d3.
Perhaps it is a lack of talent; are there enough technically trained individuals that are capable of solving data science problems? With a single entity (albeit IBM) hiring 4,000 statisticians a few years ago, it seems clear that there are definitely enough qualified people out there to do the job.
We have the tools, the people and, as we mentioned earlier, we have company buy-in, so what is the real reason? In our opinion, data scientists are spending too much time improving the technical aspect of data science rather than exploring new opportunities where data can be used as a valuable resource. To understand this concept better, lets take a look at generic 3-step view of the data science process:
- Identify the problem
- Develop a solution
- Communicate the solution
Most data scientists follow the above steps to solve “big data” issues and really spend most of their time on step 2, developing a solution. One prime example is the current kaggle competition surrounding the Heritage Health prize of $3mm that has been going for the past 2 years. In this exercise, participants develop a model to predict which patients will be admitted to a hospital in the next year using historical claims data. The quality of each proposed solution is scored and compared to the competitors until a winner is selected based on the best possible solution. Going back to our steps, participants are given the data and the problem and now spend two years developing a solution. Communicating results is omitted from the process entirely, something we’ll touch on shortly.
Taking a closer look at the leaderboard data from this competition, we can find some interesting insights. The majority of the total score achieved throughout the competition occurred within the first two months of submissions. This means that over the remaining 16 month submission period, the score only increased by minute decimal points. Let’s take a step back, would it be better to spend that 16 months finding the next decimal point of accuracy or spend that time trying to find other big gains like the competition did in the first couple of months? The answer of course is the latter, but unfortunately data scientists frequently work on the former.
The other issue with the competition that is incongruous with good problem solving practices is that, as mentioned earlier, the competition completely ignores the communication of results to the “users” (in practice, this would be managers listening and acting upon the insights of the data science team, or the actual users of a data-driven product). Similar to how a chain is only as strong as its weakest link, data analysis -- no matter how sophisticated -- is only as useful as what the end user actually understands. The communication part is so important, that we propose redefining the data science process as follows:
By rapidly iterating between communication and problem identification of the problem, we continuously reframe the problem and obtain quick progress like the Heritage Health Prize competition achieved in the first two months. Only when we are sufficiently satisfied do we then approach a solution and let the technical wizardry fly. As Einstein famously said, “If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
This process of solving problems is nothing new, designers have been solving problems using this method for decades. We implore the data science community to learn from designers, so that we can collectively solve the important problems, from unraveling the genetic code to solving societal problems, and save the decimal point war for specialized experts.