This is a right whale. Right now, there are only 450 right whales alive in the North Atlantic Ocean. They are the rarest among all large whales.
But fear not. A month ago, a group of people came to their rescue, and those people? They were data scientists.
To track the whales and monitor their population, the group NOAA (National Oceanic and Atmospheric Administration) takes aerial photos like the one above and matches it to photos taken in the past. The comparison of whales is done manually by an expert.
You see that this wastes a lot of time and money. Here, I had 3 photos to compare to, but NOAA has thousands, if not millions, of them. If NOAA could automate identifying the whales, they could use the time and money to do something else.
Last August, NOAA held a contest and invited everyone to write a program that can classify the photos. For each photo of a whale, the program had to return a set of numbers—probabilities—to describe who that whale most likely is.
For example, my program might say that for the first photo, there is a 50% chance that it shows Adam, 10% chance Brian, and 40% chance Chloe. For the next photo, it might say 20% chance Adam, 60% chance Brian, and 20% chance Chloe. And so on.
It seems like classifying photos will take a lot of time (and it does), but remember, we are making a computer do this and a computer can compute things much faster than we can. The real question is, how do we tell a computer to do this?
We use data science. Data science involves using computational methods to analyze massive amounts of data and extract knowledge from the analysis. There are 4 steps in total, and for the remainder of my post today, I want to walk you through these 4 steps one-by-one in the context of the whale challenge.
1. Collect the data
NOAA provided over 11,000 photos of whales, and told which whale was shown for about 40% of the photos. The group of photos that we know the answer to and can use to teach our program is known as a training set. We test how well our program learned to recognize the whales with the other 60%, and that group forms a test set.
2. Clean the data
If the data is incompatible with our program, then we need to fix it before we do any analysis. Right whales have white, calcified skin patches on their head called callosities. The pattern of callosities is believed to be unique for each whale, like our fingerprints and DNA. So one good way to distinguish the whales is to zoom in on their head and examine the pattern of their callosities.
But remember, we have 11,000 images. Our solution would not be practical if we have to manually clean the images. Instead, we use machine learning and teach our program how to clean the images. The program learns to find the whale’s head and separate the callosities from the background, such as water and skin (normal skin).
3. Model the data
This means teaching our program to identify all 450 right whales by the pattern of their callosities. Again, we use machine learning. All top contenders in the whale contest used a method called convolutional neural network (CNN). The idea is simple.
Consider Toastmasters, which has 15,000 clubs. Toastmasters represents the species of right whales and each club represents an individual right whale. Suppose all clubs meet every week and take a photo together. The people who show up may be different from one week to another, that is, the same club can look different in different photos. So how can we say that a given photo still represents Club X?
We can do the following: At each club meeting, we divide the people into 3 groups by their last name: (1) one from A to H, (2) one from I to Q, (3) one from R to Z. To each group, we ask the same 3 questions: (1) What is your zip code? (2) What is your age? (3) What is your income? Each person in a group can give different answers, so we combine them by taking the average or looking at the extreme value (the minimum or the maximum). Three group answers to three questions—that’s 9 numbers—and it’s these 9 numbers that represent a club on a given week.
What just happened, you ask?
- First, I grouped people (they represent pixels) by their last name. I did this to mimic spatial locality in photos, i.e. pixels that are close are likely to represent something meaningful as a group.
- Next, I chose questions that I believe can highlight each person in a club. In other words, each pixel is still important. These questions that draw out a pixel’s “potential,” if you will, are called neurons.
- Finally, I combined the answers—in math, this operation is known as convolution. (Hence the name, convolutional neural network. Network just means a model.) By combining the answers, we can expect a club’s 9 numbers to stay about the same on two different weeks.
And this is the solution: We can identify a photo in the test set by comparing its 9 numbers to the numbers in the training set. If we find numbers in the training set that are similar to the photo’s 9 numbers, then we have found our whale. If not, then we have found a new whale.
4. Communicate the data
For each photo, the program had to return a set of probabilities to describe who the whale most likely is. NOAA used a pre-determined formula using these probabilities to score how well a program classified the photos. The closer to 0, the better.
The contest ended a month ago, so let me finish by announcing the results. 364 teams entered, and many teams scored 34 points. The number one team? 0.6!
I described the problem and statistics (e.g. 40% of images formed the training set) based on the descriptions on Kaggle and reports of the top two teams. You can find their report (they are very interesting) here:
All photographs of whales shown here were provided by NOAA. Please visit their site and learn what they do in order to save marine lives.