What is a data scientist?

February 5, 2014

A while back, Josh Wills gave a talk^[1] at Airbnb about the life of a data scientist. He showed a couple of tweets that answer the question “What is a data scientist?”, and those along with some new variations have recently been making the rounds on twitter again:

“‘Data Scientist’ is a Data Analyst who lives in California”

"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."

“A data scientist is a business analyst who lives in New York.”

"A data scientist is a statistician who lives in San Francisco."

"Data Science is statistics on a Mac."

Funny and memorable, but are these tongue-in-cheek? Is ‘data science’ all hype and no substance, just some small tweaks on roles that have existed for decades?

What is a data scientist?

Well, I am a data scientist. I know that because my business card says so. When I wanted to leave the career path I’d been planning in social neuroscience and do “something else,” I had no idea what a data scientist was, either. Then one day I read a job posting for a data scientist here at Datascope and discovered, from what I read, that I already was one.

Those definitions of data scientists floating around twitter ring hollow to me, for a reason that hadn’t really hit me until this morning. I was reading a blog post^[2] where an ed tech researcher shared a couple of anecdotes about some data-driven projects he’d recently come across, and he seemed none too impressed. He doubted that the “big data” approach had many benefits, and warned that the lack of strong a priori design would result in a mountain of results to wade through and interpret.

The post contains no details of the projects, so we can’t evaluate whether we agree with his opinion of their designs. Straw man or not, however, his observations as reported are provocative. If there are useful sources that would improve the design of the projects, why pass those over? Why would someone just ignore the accumulated wisdom, leave all the design decisions to a bunch of algorithms, and cross their fingers hoping that the outcome would make sense? Just who do these Big Data Scientist people think they are, anyway?

Certain problems might be solvable in a purely data-driven approach like he seems to describe, using some elegant applications of subject-matter agnostic algorithms and magically receiving an emergent solution… maybe. But even so, I would never approach a client’s data science project in that way. I know that the growing conception of the data scientist is influenced by anecdotes like these, and I think that data scientists as a group are disserviced by this image.

The perks of being an academic ex-pat

Talking to my friends still in academia, I’m often quick to qualify that data scientists are ‘not really scientists.’ This perhaps isn’t necessary, but it reflects a fundamental difference I feel between what I used to do in research and what I do now. Our primary goal as a data science firm isn’t to advance knowledge, but to create tools for clients. We don’t document every detail of every decision we make for public scrutiny and replication, though we do participate in and contribute to the open source community. Our metrics of success aren’t peer review and citations but our clients’ satisfaction and recommendation of us to their peers.

Still, I feel that in many ways my work today isn’t different at all from what I learned to do in graduate school. The notion of the PhD as a deep expert on a narrow band of subject matter is outdated and incomplete. A career as the head of a neuroscience lab means constant skill acquisition, design challenges, problem solving, collaboration, and verbal and visual communication (all of which happen to also be core descriptors of a good data scientist).

Or neuroscientist! : ) MT @ratkins: @grapealope ;) RT @Geek_Manager: “A data scientist is a statistician who lives in San Francisco” LOL
— Rachel Kalmar (@grapealope) January 30, 2014

In neuroscience, today’s conventional wisdom has no guaranteed shelf life, the trend cycle is fast, and the toolbox is deep and continually evolving. The next project you do might require gaining rapid competence in two or three new bodies of literature, learning a new instrument, software package and/or programming language, and becoming fluid with a new set of analytical algorithms.

Curiosity + agility = competence

My perspective is probably due in part to the program where I was trained. Our qualifying exam, rather than the fire drill of knowledge competency I’ve heard of in other programs and decades, was an exercise in efficiency and flexibility. It required us to ‘bone up’ on four topics of research that were new to us, concurrently, to digest a small literature review on each topic, and to be ready to answer questions about and design experiments for those sub-fields, both in writing and in person. We were also tasked to carry ourselves like experts, never appealing to or apologizing for our recent total ignorance of the material (this my official university training in what I now know as “the swagger”).

I’m not sad that my project workflow no longer includes hours or days slaving over PubMed doing a lit review, but getting immersed in the background of a problem is still a hugely important part of the process. Today our “lit review” is an ideation workshop, two days at a whiteboard with our clients, probing all the expertise they have accumulated over years of day-to-day immersion and getting a crash course in the operations our tools will need to deal with and work within.

To me this is one of the great luxuries of working with clients -- the opportunity to get rich, productive facetime with subject-matter experts. When the experts are clients, they are uniquely motivated to help us know everything they know that is relevant to the problem, so we can design them the best tool possible.

I can only draw from my own experience, and I can’t say much about what data science means at companies where I don’t work. With design at the center of every project we do, I certainly do not feel like a statistician with a mac. If the data scientists you work with fit that description, and you are wondering what all the fuss is about, come talk to us.

[1] Full disclosure: I did not watch that talk. It’s an hour long, it’s 18 months old, they are out of the free lunch that was supposed to come with it, and I have work to do today! But I did flip through the slides on slideshare.

[2] h/t Julia Evans @bork