Sometimes, a simple function doesn't seem good enough.
Here's one possible function we can draw, going from a set of players to a set of dice rolls.
There are other, similar functions we could choose by changing where the arrows end... but any single function like this one can only give us a single outcome for each player.
As a model of dice rolling, this function is pretty useless! Each time a player rolls, hopefully there's actually a 1/6 (or about 17%) chance of each number turning up, not a 100% chance of just one of them.
How can we do better?
Sampling different paths
Let's try writing our diagrams a little differently. Focus just on Max, for a second. He rolls five times: 5, 1, 2, 4, 2.
Don't worry about the order of the rolls.
Let's merge all the arrows that arrive at the same place into single arrows. An arrow's thickness represents how frequently we follow it—so the arrow that arrives at 2 is thicker than the others.
We can show the same thing using a histogram.
Now divide his roll counts by 5—the number of times he rolled. This gives a normalized histogram in terms of the fraction of total rolls that turn up each way.
Now that we've normalized the counts, the sum of all the points in the histogram is 1, or 100%.
When we've got fractions that sum to one – or percentages that sum to 100% – across a set of outcomes, we have a probability distribution.
Our distribution says: given that Max is rolling the dice, there is a 100% chance that one of the six sides will turn up. According to what we've seen, 20% of the time Max arrives at 1, 40% of the time at 2, and so on.
Now split up the histogram bars, and arrange them vertically.
This is redundant: the bars show the same thing as the thickness of the matching arrows. But we'll usually show both in our diagrams anyway. They complement each other, as we'll see shortly. Note that we also draw the paths-never-followed – grey and dashed.
Of course, five rolls isn't enough to get a good model of six-sided dice! We can't even roll once for every side. But if Max keeps rolling and rolling for a while, and we draw up all his rolls, hopefully we'll end up with something more uniform.
Whichever distribution we arrive at, at least we're now in a better position to model dice rolls when the outcome isn't fixed to one value or another.
Channels
To properly model probability, we'll use channels. Here's an abstract example:
We're not very likely to reach from , or from .
For each point in the source set, we get an entire distribution on the target, just like we did for Max's dice rolls.
At the start of this series, I said we'd be dealing with functions. Now I've said that we'll be working with channels. But don't channels break the rules of functions? Multiple arrows start at the same place!
Well, when we want to get formal about it, a channel really can be seen as a function. It's just not as simple as the functions we've seen so far.
We'll keep breaking the rules, and draw our channels with multiple arrows starting from the same source point. While it's informal and "wrong", visually it'll be nice because it shows how a channel is a collection of hypotheses about how different outcomes lead to each other.
Suppose we're asking questions about a specific disease. There's a set which has two points: does someone have the disease (), or don't they ()? And there's another set , also with two points: does that person test positive for the disease (), or not ()?
A channel that connects to describes how strongly we believe in the different ways this test can turn out.
If the person really has the disease, our channel here says the test will probably be positive—this is the large arrow labeled T pos., for true positive. And that's good, since we want the truth about whether they're infected!
But there's some small chance that the test will fail, and tell us falsely that someone with the disease is actually disease-free. That's the false negative case.
When someone doesn't actually have the disease, there are also the true negative and false positive arrows.
That diagram shows one example of a channel from to , with specific probabilities. It describes a pretty good test, that usually gives accurate results. To describe a badly-designed test that usually gives a positive result, even when the disease is absent, we could write a different channel with a thick F pos. arrow.
And in a better world, maybe our tests would be perfectly accurate. That'd look like this:
We're back to where we started: a channel with all its weight on one arrow per starting point is just a simple function!
This isn't a useful model of the way disease tests actually work. Real tests aren't always reliable, just like (sadly) Max doesn't always roll ones.
Channel composition
Say we have two channels, and the target of the first channel is the same as the source of the second channel.
These can be glued together to form a composite that – like a composite of simple functions – forgets what happens in between.
On the right hand side, we go directly to without passing through . For consistency, we'll keep the same colors for distributions reached from the same starting point: here the composite is a function , and it uses the same arrow colours as the first function , because both start from .
The weighted arrows make it easy to see which individual paths are followed often (like ) and which aren't ().
To get the composite, we collapse all the paths that go from one point in the source, to one point in the target. For example, here are the paths that get collapsed to form the arrow :
Even though there's a low chance of following the path , there's still a decent chance of following the composite arrow because we could also have reached through or .
Paths get collapsed according to this rule: multiply along, then sum across.
First, multiply along: if there's a chance I'll reach from , and then a chance I'll reach from , then there's only a chance I'll reach from .
Then sum across: there are actually three paths from to . After repeating the multiplication step for each path, we sum the results together.
Remembering and returning?
Last time, we saw that under certain conditions, we can invert a function to get a function . Can we do the same with our channel?
It turns out the situation is not as straightforward as it was with simple functions.
If there's an chance I'll reach from , and a chance I'll reach from , what are the inverted probabilities I'll reach from , and from ?
Before we can calculate that, we'll need a little more information. Specifically, we'll need the probability that we had started from each point in in the first place. Our channel doesn't tell us that—it only tells us the probability that we would arrive at a certain point in , given that we'd started from a certain point in . Consider: if we actually start from of the time, and never from , does it make sense to say that we'd ever arrive back at , once we reverse the arrows?
We can't just flip the arrows around. For one, doesn't form a distribution over ! We could force it to be a distribution by normalizing by the total, for example . However, that would actually assume a 50:50 chance of starting from or , which is not necessarily true!
Like with simple functions, we can run into the problem of unfollowed paths when trying to invert channels: when the probability of an arrow is zero, we never follow it – so we can't invert it.
Different from a simple function, a channel could reach any target point from any source point. And the only time we absolutely have no information about how to go backwards, is when all of the arrows arriving at a given target point have zero probability.
The channel on the right hand side is more problematic in general, because no matter where we start in , we never arrive at . However, if it turns out that we always start from , then we won't be able to invert the channel on the left, either, because it specifically can't get to from .
The solution to this problem is called Bayes' theorem, and we'll come back to it in a later post.
It will be extremely useful. We already have a channel that tells us how likely we are to test positive or negative for a disease, given that we have it or not. Bayes' theorem will let us invert it, to get a channel that tells us how likely we are to have the disease or not, given that we test positive or negative – which in practice is the information we usually need.
"Conditional probability distribution"
The usual way that probability is taught, channels are called conditional probability distributions. Their functional nature is by no means neglected, but it does fade into the background much of the time.
On the other hand, it's very common for an experienced researcher to write for a value in a conditional distribution, intuitively aware that they're talking about something function-like that starts from values of and reaches distributions over .
Here, we've kept functions in the foreground since the beginning. Taking this view, channels aren't just one more tool we'll use for probability. They are the atoms of the theory – its moving parts – and almost all of our tools will be special cases of channels and their composition.
Next, we'll see two of these special cases – and the fundamentally dual roles they serve.
