Importance of Writing in Data Science

So recently I have been thinking about data science and how much writing I have been doing recently in blogging for work, blogging about football, and blogging about random things. What I has been interesting to observe is how writing and data science complement each other. I figured I would write a quick blog post for anyone in an analytic profession describing what publishing or writing down your work does to help complement your analytic capabilities. I am putting code comments as part of “writing” because in a lot of cases my write-up is started by an outline from my code comments. So here are a few ways that I think it helps tremendously:

It helps you finish projects

In a lot of cases it is easy to find quick nuggets of information in data. It’s a lot harder to talk about what those nuggets mean and why they are significant. Also, does that nugget lead to another interesting nugget?

A lot of times a completed project exists in the form of code that you wrote that not a lot of people can understand besides you. Comments in your code are important to understand the code. Write-ups are important to understand the findings you got from the model you used.

It gives you a baseline to beat

There have been a lot of times where I never feel like a project is finished because I can constantly do x, y, or z better. While this doesn’t necessarily get remedied by writing things up directly, I feel like if you have regular weekly write-up deliverables for yourself, you are more likely to write down what you have. In a lot of cases what you have isn’t a finished product in statistics. There are so many models to try, so many transforms to do, and so many other omitted variables to add into the fold. As George Box said “All models are wrong, but some are useful.” It’s almost as important to get a baseline out quickly as it is to get the best model out eventually. It gives you something to beat. It also gives you something to show. Important to emphasize in the writeup that this is just a baseline and more analytics should be done before this is taken seriously. Be aware that project managers and executives will be ready to roll with whatever you say. It’s on you to make the model better.

It gives you a clear outline on where to go

In a lot of cases the process of writing up your findings will make you be more critical in assessing model violations as well as giving you a place to go. There is almost always a conclusion/next steps portion to a writeup. When you write something up you spend time trying to explain why you used what to yourself and others. I think the old adage that you don’t know a subject until you can teach it holds here. You don’t really know the data until you have to write about it. I have never finished a writeup thinking “that’s it, I’m done.” The process usually helps me think of questions that I didn’t have while writing the code, violations in the model assumptions I may have made, or other pieces of data that I might want to add to the analysis.

It helps you communicate to anyone that relies on you for insights

This one is really important. This is also one of the biggest reasons I created this blog. Complex models are cool, but being able to explain complex models in a straight-forward manner is extremely valuable. Scratch that. Being able to explain ANY statistical concept in a straight-forward manner is extremely valuable. Not everyone you are going to talk to will have a masters or PhD in statistics. Being able to clearly communicate your findings is a skill, not a given. It has been my experience that it is a skill that many data scientists put little stock in at the detriment of their own careers. The data scientist that can explain advanced concepts to the everyman is the real hero, not the one that gets frustrated that nobody understands him or her. I have also seen scenarios when the correct numbers and verbiage are not used simply because the person that matters in the company didn’t understand what you just told them. Writing stuff up helps you understand how they want to hear what you are trying to say. That sentence might sound weird, but its important. Trust me. We are outliers in our train of thought and approaches. The outlier should never simply be ignored, but investigated (corny stats joke, feel free to ignore).

It helps you understand what value it adds

I have seen countless forays into analytics that are interesting, but what really matters is if they are valuable. You can run the most bomb model in all of models and if it gives a 2% accuracy boost over a linear model, why should anyone care? Outside of the learning I might have had from running that model in the wild, chances are nobody will ever care. There are certain exceptions to this, but unless you are working on the most cutting edge methods, I doubt most of us see much value in a 2% accuracy boost. I think a writeup helps you answer the question “Why is what I did important.”

Anyway, that’s my two cents on the matter. As always feel free to say “You don’t know anything about anything” and move on or bash me in the comments. I welcome your criticism. Maybe you think I missed or overlooked something. I welcome your insights.


One thought on “Importance of Writing in Data Science

  1. Good thoughts. What one typically sees in the articles about what are the most important skills is something like this:
    While this articles hits all the bases, it concludes with the most important one. Care to guess what the most important language for a data scientist to master is? Yes, in fact, it is English (or fill in the blank of the lingua-franca at your locale). Why? Because if you cannot communicate your work to your colleagues, bosses, customers, yourself (such as your future self looking back on old work) then the work has less value. You were presumably hired because your technical skills were considered sufficient in one or more of the areas relevant to the work. Your ability to think critically, produce work, learn, and communicate is even more important; given that you hurdled the initial technical bar.

    I often tell our engineers (and myself) that if you cannot get what is in your head onto paper then the information is of little value. Why do I say this? Because if the insight is trapped in the mind of “the one”, then it is not actionable by the one or larger group. Also, as important, the very act of writing causes the writer to reflect and challenge their own assumptions about things like 1) their understanding of the problem, 2) logic and approach to solution, 3) consistency, 4) alternatives, 5) refinement, etc. The engineers, my students, and even myself, are frequently not happy with the truth that the first draft is usually not a good final product but we have come to believe that it is good enough. And move on to the next thing. A delicate balance, this one, for sure. Axiom: know your stopping rule in advance.

    All too often I see analytical results that can fill hundreds of powerpoint pages, yet when I look at an individual page (or even blocks of them) they cannot stand on their own and it is hopeless for the viewer (even with a high degree of expertise) to: 1) understand the context and content {tables of data, graphs without proper labels and legends, or explanation, beg me to “trust the expert” without necessarily understanding the expert}, 2) draw useful conclusions, 3) understand the point the author is trying to make or the decision that they think is implied by the result {frequently it is just a data dump and they have not thought about the implications}. And this is true often even if the developer is there explaining them. What is the validity of the result? Where is the insight I am supposed to come away with, and how is it relevant to a decision (or action) to be made?

    Your last comment is an excellent one for consideration and future discussion. in my world, we talk about model and simulation “fidelity”. Everyone has their notion of what they mean by fidelity, and it is usually not the same across the group. But the majority group-think (especially among the decision makers) is that higher fidelity is better. This is a pure canard. I tend to think along the lines of effectiveness. Was your model and simulation effective at solving the problem at hand. Then it was good enough.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s