Last time, I sort of awkwardly jumped into trying to derive Bayes Rule without explaining what it was for or why you would want it. Sorry about that!
The point of Bayes Rule is that it lets you do the probabilistic version of "abductive reasoning" [1].
Abductive reasoning: If p implies q, and q, then maybe p!
Now, introduce some uncertainty. If basketball players are likely to be tall, and you meet somebody who's tall, then maybe they play basketball. This isn't necessarily true in a binary logic setting, of course -- but we're dealing with probability here.
Now let's talk about how to calculate how much "maybe" in "maybe they play basketball".
Now for the derivation. It's pretty straightforward, but brain-o during class.
P(A|B) = P(A ^ B) / P(B) // definition
P(A ^ B) = P(B ^ A) = P(B) * P(A|B) = P(A) * P(B|A) // definition
OK now, we want a formula for P(B|A) made of P(A|B).
P(B|A) = P(B ^ A) / P(A) // definition
P(B|A) = P(B) * P(A|B) / P(A) // expand definition of P(B^A)
P(B|A) = P(A|B) * P(B) / P(A) // ... and that's it, really. (rearrange slightly)
Let's do this: we're going to build a basketball-player classifier. We want to be able to tell whether people are basketball players.
What's the probability that a given person is "tall"? Let's say 0.3 (we could measure this, go out and take samples of people or look at medical records or whatever)
What's the probability that a given person plays basketball? Let's say 0.1. (we could take a survey)
What's the probability that a given basketball player is tall? Say 0.6. We'll say that most basketball players are tall people.
OK, so now we have P(tall), P(basketball), and P(tall | basketball).
Now we want to be able to compute P(basketball | tall). Using Bayes Rule, that's just:
P(basketball | tall) = P(tall | basketball) * P(basketball) / P(tall)
So when we see a person out in the world, and they're tall, then the probability that they play basketball, we've determined WITH MATH, is:
P(basketball | tall) = 0.6 * 0.1 / 0.3 = 0.2.
Conclusion? Just because somebody is tall, they don't necessarily play basketball, at least in this fictional world.
Let's do a little bit more math here.
We want to be able to compute...
P( basketball | tall, basketballshorts, holding-a-basketball, currently-playing-basketball, reads-basketball-news ...)
We can't have measured that directly, unless we had a really, really big dataset. Say we had 32 different features, we would need to see four billion different people (at a minimum!) to cover all the different combinations.
But we can use Bayes Rule to do what? We know that if we can compute:
P(features | basketball) * P(basketball) / P(features)
... then we're good!
This leaves us with the question of how to compute P(features) and also P(features | basketball).
We might not have seen the current combination of features in the training data. What can we do?
We can make the simplifying assumption that the features are all mutually independent of one another, and then we can calculate their joint probability just by multiplying their probabilities together. This works for P(f1, f2, f3 ... | basketball) as well. If the features are all independent, this is just: P(f1 | basketball) * P(f2 | basketball) ... etc.
So we're set! Now we can compute P(basketball | features).
How to operationalize this? How do you use it to make a decision? "Is this person a basketball player" ?
What if you want to do multi-way classification? You suspect that everybody either plays basketball, or golf, or badminton, and given a person, you want to decide which. How do you do that?
We've got the period-classification task we were talking about before... what are some good features you could use for that?
Text classification problems:
- genre / subject assignment?
- spam detection?
- authorship attribution?
- assigning age or gender?
- language identification?
- sentiment analysis? ...
One neat thing about all of these is that there are a bunch of different classification algorithms that you can use; you can often try out a different one. Although we're finding that bigger and better data (and better features) typically beats a fancier learning algorithm, especially in the limit.
Train on this data set:
... and now let's evaluate this one:
newyorker, tall, basketballshorts .
What does that person play? Intuitively? Now let's do the math and see what they play?
How do we cope with the situation that our model wants to assign a 0 probability to many events?
Intuitively, just because we didn't see something in the training data, we don't want to assign a 0 probability to it; it's probably not COMPLETELY INCONCEIVABLE.
What if we pretended that we had seen every (possible?) event once before? OK, let's try that. Just write down all the events that we know about from the training set, and we'll say that they happened once.
Now we go through and we say that things that really happened t times, happened t+1 times. Easy enough.
What do we do for the denominator? Since we ended up adding V fake events (one for every possibility, whether it happened or not), we're going to have to add that to the denominator.
This is called Laplace Smoothing, after Pierre-Simon Laplace, who was pretty cool. Go look at his wikipedia article.
Alright, let's try this again.
** re-do the example, try it again **
sportsillustrated, tall, basketballshorts.
We can't even look up sportsillustrated in our table at all!
OK, here's a thing we can do. We'll map events that we've never seen before to a special "UNSEEN" event. In NLP, this is more typically called OOV -- out of vocabulary.
One more time! Add the magical UNSEEN event, tweak the probabilities again.
NB: you're going to have to do this for your homework. We're basically doing the homework manually right now.
What kinds of events do you think we might see, that we haven't seen before? ...
Add-one smoothing is pretty ham-fisted. There's been a lot of work on better smoothing methods -- the general intuition is that we'll use the data set we've been given to estimate how likely it is that we're going to see a brand-new event in the future. The question here is, how much are we going to believe the data? If you believe the data completely, you're taking what's called the MAXIMUM LIKELIHOOD ESTIMATE or MLE, which is the estimate that makes your training data most likely.
With smoothing in general, what we're doing is taking some of the probability mass (remember, to have a proper probability distribution, we only have 1.0 of it to allocate), and we're REDISTRIBUTING THE WEALTH from events that we've seen a lot, to events that we haven't seen as much. Including events that haven't happened at all.
Chapter 4 has a bunch of good information about smoothing. We'll talk about this more when we do some n-gram models, which is coming up.
What kinds of numbers are probabilities? They're definitely between 0 and 1.
Hey, so, what happens when you multiply probabilities together? What happens when you multiply lots and lots of probabilities together?
We get some really small numbers, really fast. Pretty soon, we could get an underflow. How to avoid that?
Well, it turns out that...
log( a * b) = log(a) + log(b) // this happens to be true for the general case, mathematically.
So what we can do, and what's often done in practice, when we're dealing with lots of probabilities, is we deal with log probabilities. Sometimes even what's called positive log probabilities, so that they'll be positive. Because, what's the log of a number less than 1.0?
If we're dealing with positive log probs, what's the interpretation here? What does a really big positive log prob mean?
Then, how do we get the probability back, say if we need that?
http://humanities.uchicago.edu/faculty/goldsmith/Industrial/Probability.htm