Week 14 Hangover and Updated Efficiency

Week 14 Hangover

As promised, I wanted to say what I got right and what I got wrong from my previous post that can be found here.

Cowboy’s offense versus Giant’s defense 

Although I got the prediction wrong, I think my reasoning was correct. What my efficiency numbers do not do well is understand the likelihood of an explosive play. So if you gain 5 yards on 3rd and 5 that is the same as gaining 25 yards on 3rd and 5. I am currently trying to find a way to change this, but it seems the Giant’s offense is built around ODBJ making hugely explosive plays a couple of times a game. Take out the 61 yard TD pass to Odell and Eli was 16/27 for 132 yards. This matchup didn’t disappoint and it seems the formula to beat the Cowboys is a crazy good defense with a serviceable offense.

Baltimore at New England

I  picked this one right and got the logic correct again, but I have to admit I expected the game to be closer and the score to be lower.  The gameplan from NE was particularly interesting. To manufacture a run they did a few receiver sweeps and ran outside the tackles a good bit. The final stat line is a little misleading because the Patriots got up big relatively early in the game and thus ran the ball more, but they did what I thought they might do by sticking to the passing game against this defense.

Washington’s offense versus Philadelphia’s defense

This game was weird. I got the pick wrong here, but in the first half of this game, the Eagles should have been up big. Their defense did a great job against this offense for most of the game, but their offense kept shooting itself in the foot. This game also illustrated that finding a way to try to predict yardage gains and not simply the binomial 0/1 success/fail of plays is really important.

Atlanta’s defense versus LA’s offense

I got this very wrong and couldn’t be happier about it. I was elated to see the Falcon’s defense destroy this Ram’s offense. Outside of the first drive and the last quarter, the Rams were hardly able to move the ball. I am still cautious about the Falcon’s defense. I think Vic Beasley Jr. has been insanely good, but I worry about te secondary. Deion Jones has looked very rangy over the middle of the field as of late, but I still worry about the outside talent and all of the “deep 3” in the deep 3 zone concept. With Trufant out, teams can really attack this team deep.

Why are the Raiders and Chiefs good?

Well, I think the short answer to this is that they are inefficient, but thrive in the extreme parts of the game. What I mean by this is big plays and turnovers for the Raiders and Chiefs respectively. I do want to say that I have little faith in the ability of the Raider’s offensive coordinator. From the efficiency statistics in my previous post, it was clear that the Chiefs were very bad at defending the run. From a not using my statistics, but looking at the game, they were 40% run versus 60% pass and 4.7 yards per rush versus 2.9 yards per pass attempt. So what were the Raiders doing? Their QB has a jacked pinkie on this throwing hand and it was 12 degrees Farenheight. This is one of the scenarios where I feel like the coach believes their team is so good they focus on their strength instead of the other team’s weakness. Focusing on your opponent’s weakness is the way of good football coaches and the Sith.

Season Pick Stats

Logic: 4/5

Outcome: 2/4 (didn’t pick the RAI/KAN game)

 

Updated Efficiency

These numbers will reflect through week 14.

ovr_eff_2016wk14

off_eff_2016wk14

def_eff_2016wk14

 

I see you

Hey everyone, been a while since I posted something so I figured I would just go ahead and share some of the stuff I am working on. For those of you interested in it from a technical standpoint I am going to go light on the theory and code. Im just going to show you the results and some of the interesting bits about it.

What have I been working on? Teaching computers to see like people do. Its super duper fun. What I want to focus on in this post is recognition in images AFTER I have trained a model. If you’re curious about how I created the model I am not going to cover it here. I might make a post about it later so stay tuned, but these are some of the buzzwords of things that I used:

  • Python – Programming language
  • Opencv – computer vision library for python/C++
  • dlib – computer vision library for python/C++
  • Tensorflow – Deep learning framework by Google (released  to the public a couple months ago)
  • Convolutional Neural Networks or convnets
  • AWS

What I used for data is the CIFAR10 dataset. Its a fairly popular dataset for image recognition tasks and has well-established benchmarks. It is a collection of images that are classified into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. What I created got a test accuracy of 89.7% which would have been good enough for 24th a couple years ago in a data science competition hosted on kaggle. Not too shabby.

So I created this cool model but what I wanted to talk about was how to use that model to find stuff in new images. This post will be about using semantic segmentation over the naive approach to recognize images. There will be pretty pictures to illustrate what I am talking about. So first, lets look at our input image:

dogandcat

Awww. So cute.

What is the naive approach?

The naive approach in this case is trying to find “interesting” data points in images to then feed to the model. There are a few ways to do this. There is a built in function to the dlib library, but before I knew about this I actually built my own using opencv. I named it boundieboxies because I don’t take myself, my work, or grammar srsly. Overall, I just try to give as few fucks as possible. What it does is the following:

  1. Finds interesting points that exist in the image. You can read up on the documentation of opencv, but basically it looks for color gradients and uses some edge detection algorithms.
  2. Uses K-means clustering of the x, y coordinates of the points found.
  3. Creates a box around the cluster with some padding.

Boom. Easy as that.

In case some of you are not familiar with K-means clustering. Its a clustering algorithm used to group data together. The gif below pretty much explains it better than any statistician. There are some complexities to K-means, but I’m not going to belabor this post with it.

kmeans

Boundieboxies does a good job with foreground object detection in images and has comparable speeds to the dlib library up to around 500 objects per image. Which is a shit ton of objects for an image. Over 500 objects, dlib smokes boundieboxies in performance.

Thats an unrealistic scenario Okay?! Okay? Okay….. okay….

okay

Okay… mine isn’t as good at scale.

So running the program and getting my “potential” objects took 0.170 seconds and found 12 potential objects. This is what the output looks like for boundieboxies:

Screen Shot 2016-02-26 at 2.15.21 PM

Dope. It looks like it did a good job of finding the dog and the cat with a few of these boxes. To look at how boundieboxies compares to the dlib find_candidate_object_locations function, the dlib output is below with a runtime of 0.143 seconds and found 10 potential objects.

Screen Shot 2016-02-26 at 2.24.28 PM

Both algorithms work well in about the same amount of time. Boundieboxies is better for foreground object detection and does a good job of that while the dlib function does a good job of blending objects with the background. Both have their uses and a comparable runtime (when there aren’t very many objects <cries gently>).

So now I know what to feed my convnet model to determine what each box likely is and hopefully get the computer to tell me that there is a doggy and a kitty in this picture. Running the different boxes through the model took 0.412 seconds and the output was:

  • Box 1: Dog, score: 792
  • Box 2: Ship, score: 92
  • Box 3: Cat, score: 228
  • Box 4: Ship, score: 53
  • Box 5: Dog, score: 1013
  • Box 6: Cat, score: 346
  • Box 7: Dog, score: 220
  • Box 8: Dog, score: 201
  • Box 9: Airplane, score: 151
  • Box 10: Cat, score: 304
  • Box 11: Ship, score: 35
  • Box 12: Dog, score: 505

So we basically got what we wanted. The boxes were mostly recognized as dogs and cats. The ship comes in because the bed they’re laying on kinda looks like a boat. Even if there are some random objects, the top scores were for dog and cat by a pretty significant margin.

So in summary the naive approach used boxes to feed into the trained convnet model to produce output. Overall it took 0.170 + 0.412 = 0.582 seconds. Nice!

Semantic Segmentation

So now lets move onto semantic segmentation. This is something that is new (to me) and took a good bit of figuring out to get it working on tensorflow as both tensorflow and this method are relatively new. There are actually a lot of ways to do it and for the most part it is used in academia so I had to decode a white paper to get it working. The white paper is here:

http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

Nerd Summary:

Semantic segmentation is created by transforming what would be neural network layers to fully convolutional layers, adding some skip layers into the model architecture, and scaling the predictions up to have a weighted output the same shape as the input image.

Regular Summary:

There is a cool trick to make the model look at pieces of the picture as the picture moves through the trained model. What it gives us is classifications as well as pixel-by-pixel probabilities of what class it is a part of.

Picture Summary:

It does this.

figure_1.png

Pretty cool right? Its able to mix what each pixel is with the probability of what it is. Or as I like to say, “It goes all Predator”. In addition to it being more precise with more robust output, it actually runs faster too! The output above took 0.371 seconds. Once again you get the fact that the bed looks like a boat and it also thinks the dog’s ear is a bird. It clearly sees the dog and cat in the image (red means a high likelihood that that pixel has the given category).

That is a savings of 0.582 – 0.371 = 0.211 seconds per image. For this example that might seem very small, but say you are running this on a million images or the images have more objects than just a couple. That adds up. Also, the output holds a lot more information.

Importance of Writing in Data Science

So recently I have been thinking about data science and how much writing I have been doing recently in blogging for work, blogging about football, and blogging about random things. What I has been interesting to observe is how writing and data science complement each other. I figured I would write a quick blog post for anyone in an analytic profession describing what publishing or writing down your work does to help complement your analytic capabilities. I am putting code comments as part of “writing” because in a lot of cases my write-up is started by an outline from my code comments. So here are a few ways that I think it helps tremendously:

It helps you finish projects

In a lot of cases it is easy to find quick nuggets of information in data. It’s a lot harder to talk about what those nuggets mean and why they are significant. Also, does that nugget lead to another interesting nugget?

A lot of times a completed project exists in the form of code that you wrote that not a lot of people can understand besides you. Comments in your code are important to understand the code. Write-ups are important to understand the findings you got from the model you used.

It gives you a baseline to beat

There have been a lot of times where I never feel like a project is finished because I can constantly do x, y, or z better. While this doesn’t necessarily get remedied by writing things up directly, I feel like if you have regular weekly write-up deliverables for yourself, you are more likely to write down what you have. In a lot of cases what you have isn’t a finished product in statistics. There are so many models to try, so many transforms to do, and so many other omitted variables to add into the fold. As George Box said “All models are wrong, but some are useful.” It’s almost as important to get a baseline out quickly as it is to get the best model out eventually. It gives you something to beat. It also gives you something to show. Important to emphasize in the writeup that this is just a baseline and more analytics should be done before this is taken seriously. Be aware that project managers and executives will be ready to roll with whatever you say. It’s on you to make the model better.

It gives you a clear outline on where to go

In a lot of cases the process of writing up your findings will make you be more critical in assessing model violations as well as giving you a place to go. There is almost always a conclusion/next steps portion to a writeup. When you write something up you spend time trying to explain why you used what to yourself and others. I think the old adage that you don’t know a subject until you can teach it holds here. You don’t really know the data until you have to write about it. I have never finished a writeup thinking “that’s it, I’m done.” The process usually helps me think of questions that I didn’t have while writing the code, violations in the model assumptions I may have made, or other pieces of data that I might want to add to the analysis.

It helps you communicate to anyone that relies on you for insights

This one is really important. This is also one of the biggest reasons I created this blog. Complex models are cool, but being able to explain complex models in a straight-forward manner is extremely valuable. Scratch that. Being able to explain ANY statistical concept in a straight-forward manner is extremely valuable. Not everyone you are going to talk to will have a masters or PhD in statistics. Being able to clearly communicate your findings is a skill, not a given. It has been my experience that it is a skill that many data scientists put little stock in at the detriment of their own careers. The data scientist that can explain advanced concepts to the everyman is the real hero, not the one that gets frustrated that nobody understands him or her. I have also seen scenarios when the correct numbers and verbiage are not used simply because the person that matters in the company didn’t understand what you just told them. Writing stuff up helps you understand how they want to hear what you are trying to say. That sentence might sound weird, but its important. Trust me. We are outliers in our train of thought and approaches. The outlier should never simply be ignored, but investigated (corny stats joke, feel free to ignore).

It helps you understand what value it adds

I have seen countless forays into analytics that are interesting, but what really matters is if they are valuable. You can run the most bomb model in all of models and if it gives a 2% accuracy boost over a linear model, why should anyone care? Outside of the learning I might have had from running that model in the wild, chances are nobody will ever care. There are certain exceptions to this, but unless you are working on the most cutting edge methods, I doubt most of us see much value in a 2% accuracy boost. I think a writeup helps you answer the question “Why is what I did important.”

Anyway, that’s my two cents on the matter. As always feel free to say “You don’t know anything about anything” and move on or bash me in the comments. I welcome your criticism. Maybe you think I missed or overlooked something. I welcome your insights.