Building a data science team
February 9, 2012
Spurred on by a couple of articles on building data science teams, we thought we would provide our perspective on the issue. We may have a slightly different take because our data science team is not just a part of our company, it is our company.
Data science is akin to design
With a nerdy name like data science (itself a term that has multiple understood meanings), you might expect a team of data scientists to be the bespectacled nerds in the back room: socially awkward, jargon spewing, and furiously clacking away on their keyboards. This image could not be further from the truth. An effective data science team looks a lot like an effective design team: brainstorming creative ideas, making prototypes, receiving feedback, telling stories, and deeply understanding the needs of others. Building great design teams is an elusive goal, but there is a vast amount written about it that is invaluable if you are building a data science team.
Just as with design, it is critical to have an environment that supports intensely collaborative work. Data science is astoundingly fun and creative, don’t stifle the process with boring cubes, structured meetings and needless bureaucracy. Data scientists need room to think and program, both together in groups and alone quietly; provide your data scientists with space for pair programming, unstructured brainstorming, inspired programming sessions at a cafe, and solo tweak-out sessions.
Communicating results is more important than optimizing algorithms
Getting one extra decimal point of predictive power is too often the focus of data science competitions. The perception that “prediction is everything” often leads data scientists to develop complex statistical models that are impossible to communicate to end users, who are invariably not data scientists. Often, simple mechanistic models (popular in physics and engineering) have as much or more predictive power than generic statistical approaches (popular in machine learning and statistics), and have two enormous advantages: (i) they are easier to translate into intuitive visualizations, and (ii) the adjustable parameters are actually something that non-analysts can understand, interpret, and adapt as their use case adapts. As a result, simple models are more valuable for the people who actually make decisions. The “downside” of this approach is that the data science team actually needs to think and learn about the process they are modeling; which leads us to the next point:
Value breadth, and the ability to “deep dive”
Important problems are rarely confined to one discipline. Your data science team doesn’t need to know everything from all disciplines (logistic regression, unit operations, statistical process control, bayesian networks, bootstrap hypothesis testing, particle filters, phylogenetics, field theory, social network analysis, nonlinear differential equations, this list could get really long really fast), but your team does need to know of these things: how to ask the right questions, and where to look to find the answers. Build a team that not only comes from diverse technical backgrounds, but has passion to find the best way to solve a problem, learn what they need to know quickly, and apply their new knowledge effectively.
Don’t be handcuffed by tools
Python, Hadoop, Processing, D3, Raphael, R, Ruby, MySQL... these are just a fraction of the powerful open source tools that are the bread and butter of data scientists. There is no single tool that is the right fit for analytics and visualization projects all of the time. Your data science team needs to (i) know or be able to find the right tool for a particular task, (ii) learn it quickly, and (iii) integrate it into a system that informs and delights users.
When setting goals, avoid the temptation to pursue “merely interesting” problems (“Let’s figure out a faster way to compute the first billions digits of π!”) instead of “interesting and important” problems where a solution will create business value. There are way more than enough interesting and important problems to keep all of the world’s data science teams busy. Many lean startup principles, such as constant refinement and hunting for a Minimum Viable Product, apply equally well to the product development process that your data science team will be involved with.