Hey everyone, been a while since I posted something so I figured I would just go ahead and share some of the stuff I am working on. For those of you interested in it from a technical standpoint I am going to go light on the theory and code. Im just going to show you the results and some of the interesting bits about it.
What have I been working on? Teaching computers to see like people do. Its super duper fun. What I want to focus on in this post is recognition in images AFTER I have trained a model. If you’re curious about how I created the model I am not going to cover it here. I might make a post about it later so stay tuned, but these are some of the buzzwords of things that I used:
- Python – Programming language
- Opencv – computer vision library for python/C++
- dlib – computer vision library for python/C++
- Tensorflow – Deep learning framework by Google (released to the public a couple months ago)
- Convolutional Neural Networks or convnets
What I used for data is the CIFAR10 dataset. Its a fairly popular dataset for image recognition tasks and has well-established benchmarks. It is a collection of images that are classified into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. What I created got a test accuracy of 89.7% which would have been good enough for 24th a couple years ago in a data science competition hosted on kaggle. Not too shabby.
So I created this cool model but what I wanted to talk about was how to use that model to find stuff in new images. This post will be about using semantic segmentation over the naive approach to recognize images. There will be pretty pictures to illustrate what I am talking about. So first, lets look at our input image:
Awww. So cute.
What is the naive approach?
The naive approach in this case is trying to find “interesting” data points in images to then feed to the model. There are a few ways to do this. There is a built in function to the dlib library, but before I knew about this I actually built my own using opencv. I named it boundieboxies because I don’t take myself, my work, or grammar srsly. Overall, I just try to give as few fucks as possible. What it does is the following:
- Finds interesting points that exist in the image. You can read up on the documentation of opencv, but basically it looks for color gradients and uses some edge detection algorithms.
- Uses K-means clustering of the x, y coordinates of the points found.
- Creates a box around the cluster with some padding.
Boom. Easy as that.
In case some of you are not familiar with K-means clustering. Its a clustering algorithm used to group data together. The gif below pretty much explains it better than any statistician. There are some complexities to K-means, but I’m not going to belabor this post with it.
Boundieboxies does a good job with foreground object detection in images and has comparable speeds to the dlib library up to around 500 objects per image. Which is a shit ton of objects for an image. Over 500 objects, dlib smokes boundieboxies in performance.
Thats an unrealistic scenario Okay?! Okay? Okay….. okay….
Okay… mine isn’t as good at scale.
So running the program and getting my “potential” objects took 0.170 seconds and found 12 potential objects. This is what the output looks like for boundieboxies:
Dope. It looks like it did a good job of finding the dog and the cat with a few of these boxes. To look at how boundieboxies compares to the dlib find_candidate_object_locations function, the dlib output is below with a runtime of 0.143 seconds and found 10 potential objects.
Both algorithms work well in about the same amount of time. Boundieboxies is better for foreground object detection and does a good job of that while the dlib function does a good job of blending objects with the background. Both have their uses and a comparable runtime (when there aren’t very many objects <cries gently>).
So now I know what to feed my convnet model to determine what each box likely is and hopefully get the computer to tell me that there is a doggy and a kitty in this picture. Running the different boxes through the model took 0.412 seconds and the output was:
- Box 1: Dog, score: 792
- Box 2: Ship, score: 92
- Box 3: Cat, score: 228
- Box 4: Ship, score: 53
- Box 5: Dog, score: 1013
- Box 6: Cat, score: 346
- Box 7: Dog, score: 220
- Box 8: Dog, score: 201
- Box 9: Airplane, score: 151
- Box 10: Cat, score: 304
- Box 11: Ship, score: 35
- Box 12: Dog, score: 505
So we basically got what we wanted. The boxes were mostly recognized as dogs and cats. The ship comes in because the bed they’re laying on kinda looks like a boat. Even if there are some random objects, the top scores were for dog and cat by a pretty significant margin.
So in summary the naive approach used boxes to feed into the trained convnet model to produce output. Overall it took 0.170 + 0.412 = 0.582 seconds. Nice!
So now lets move onto semantic segmentation. This is something that is new (to me) and took a good bit of figuring out to get it working on tensorflow as both tensorflow and this method are relatively new. There are actually a lot of ways to do it and for the most part it is used in academia so I had to decode a white paper to get it working. The white paper is here:
Semantic segmentation is created by transforming what would be neural network layers to fully convolutional layers, adding some skip layers into the model architecture, and scaling the predictions up to have a weighted output the same shape as the input image.
There is a cool trick to make the model look at pieces of the picture as the picture moves through the trained model. What it gives us is classifications as well as pixel-by-pixel probabilities of what class it is a part of.
It does this.
Pretty cool right? Its able to mix what each pixel is with the probability of what it is. Or as I like to say, “It goes all Predator”. In addition to it being more precise with more robust output, it actually runs faster too! The output above took 0.371 seconds. Once again you get the fact that the bed looks like a boat and it also thinks the dog’s ear is a bird. It clearly sees the dog and cat in the image (red means a high likelihood that that pixel has the given category).
That is a savings of 0.582 – 0.371 = 0.211 seconds per image. For this example that might seem very small, but say you are running this on a million images or the images have more objects than just a couple. That adds up. Also, the output holds a lot more information.