16 min read
Published: 11 December 2023
Updated: 14 May 2024

Channels

Sometimes, a simple function doesn't seem good enough.

Here's one possible function we can draw, going from a set of players to a set of dice rolls.

Dice function

There are other, similar functions we could choose by changing where the arrows end... but any single function like this one can only give us a single outcome for each player.

As a model of dice rolling, this function is pretty useless! Each time a player rolls, hopefully there's actually a 1/6 (or about 17%) chance of each number turning up, not a 100% chance of just one of them.

How can we do better?

Sampling different paths

Let's try writing our diagrams a little differently. Focus just on Max, for a second. He rolls five times: 5, 1, 2, 4, 2.

Samples

Don't worry about the order of the rolls.

Let's merge all the arrows that arrive at the same place into single arrows. An arrow's thickness represents how frequently we follow it—so the arrow that arrives at 2 is thicker than the others.

Distribution from samples

We can show the same thing using a histogram.

Histogram

Now divide his roll counts by 5—the number of times he rolled. This gives a normalized histogram in terms of the fraction of total rolls that turn up each way.

Normalized histogram

Now that we've normalized the counts, the sum of all the points in the histogram is 1, or 100%.

20%+40%+0%+20%+20%+0%=100%20\% + 40\% + 0\% + 20\% + 20\% + 0\% = 100\%

When we've got fractions that sum to one – or percentages that sum to 100% – across a set of outcomes, we have a probability distribution.

Our distribution says: given that Max is rolling the dice, there is a 100% chance that one of the six sides will turn up. According to what we've seen, 20% of the time Max arrives at 1, 40% of the time at 2, and so on.

Now split up the histogram bars, and arrange them vertically.

Distribution diagram

This is redundant: the bars show the same thing as the thickness of the matching arrows. But we'll usually show both in our diagrams anyway. They complement each other, as we'll see shortly. Note that we also draw the paths-never-followed – grey and dashed.

Of course, five rolls isn't enough to get a good model of six-sided dice! We can't even roll once for every side. But if Max keeps rolling and rolling for a while, and we draw up all his rolls, hopefully we'll end up with something more uniform.

Fair distribution

Whichever distribution we arrive at, at least we're now in a better position to model dice rolls when the outcome isn't fixed to one value or another.

Channels

To properly model probability, we'll use channels. Here's an abstract example:

Channel

We're not very likely to reach y1\mathsf{y}_{1} from x1\mathsf{x_1}, or y2\mathsf{y}_{2} from x2\mathsf{x}_{2}.

For each point in the source set, we get an entire distribution on the target, just like we did for Max's dice rolls.

At the start of this series, I said we'd be dealing with functions. Now I've said that we'll be working with channels. But don't channels break the rules of functions? Multiple arrows start at the same place!

Well, when we want to get formal about it, a channel really can be seen as a function. It's just not as simple as the functions we've seen so far.

We'll keep breaking the rules, and draw our channels with multiple arrows starting from the same source point. While it's informal and "wrong", visually it'll be nice because it shows how a channel is a collection of hypotheses about how different outcomes lead to each other.

Suppose we're asking questions about a specific disease. There's a set D\mathsf{D} which has two points: does someone have the disease (d+d\mathsf{+}), or don't they (dd\mathsf{-})? And there's another set T\mathsf{T}, also with two points: does that person test positive for the disease (t+t+), or not (tt-)?

A channel that connects D\mathsf{D} to T\mathsf{T} describes how strongly we believe in the different ways this test can turn out.

Disease test channel

If the person really has the disease, our channel here says the test will probably be positive—this is the large arrow labeled T pos., for true positive. And that's good, since we want the truth about whether they're infected!

But there's some small chance that the test will fail, and tell us falsely that someone with the disease is actually disease-free. That's the false negative case.

When someone doesn't actually have the disease, there are also the true negative and false positive arrows.

That diagram shows one example of a channel from D\mathsf{D} to T\mathsf{T}, with specific probabilities. It describes a pretty good test, that usually gives accurate results. To describe a badly-designed test that usually gives a positive result, even when the disease is absent, we could write a different channel with a thick F pos. arrow.

And in a better world, maybe our tests would be perfectly accurate. That'd look like this:

Deterministic channel

We're back to where we started: a channel with all its weight on one arrow per starting point is just a simple function!

Function

This isn't a useful model of the way disease tests actually work. Real tests aren't always reliable, just like (sadly) Max doesn't always roll ones.

Channel composition

Say we have two channels, and the target of the first channel is the same as the source of the second channel.

Two channels

These can be glued together to form a composite that – like a composite of simple functions – forgets what happens in between.

Channel composition

On the right hand side, we go directly to Z\mathsf{Z} without passing through Y\mathsf{Y}. For consistency, we'll keep the same colors for distributions reached from the same starting point: here the composite is a function XZ\mathsf{X}\to\mathsf{Z}, and it uses the same arrow colours as the first function XY\mathsf{X}\to\mathsf{Y}, because both start from X\mathsf{X}.

The weighted arrows make it easy to see which individual paths are followed often (like x2y1z2x_2\mapsto y_1 \mapsto z_2) and which aren't (x1y1z1x_1\mapsto y_1 \mapsto z_1).

To get the composite, we collapse all the paths that go from one point in the source, to one point in the target. For example, here are the paths that get collapsed to form the arrow x1z1x_{1}\mapsto z_{1}:

Composition single arrow

Even though there's a low chance of following the path x1y1z1x_1\mapsto y_1 \mapsto z_1, there's still a decent chance of following the composite arrow x1z1x_{1}\mapsto z_{1} because we could also have reached z1z_{1} through y2y_{2} or y3y_{3}.

Paths get collapsed according to this rule: multiply along, then sum across.

First, multiply along: if there's a 10%10\% chance I'll reach y1y_1 from x1x_1, and then a 20%20\% chance I'll reach z1z_{1} from y1y_{1}, then there's only a 10%×20%=2%10\%\times20\%=2\% chance I'll reach z1z_{1} from x1x_1.

Then sum across: there are actually three paths from x1x_1 to z1z_1. After repeating the multiplication step for each path, we sum the results together.

Remembering and returning?

Last time, we saw that under certain conditions, we can invert a function XY\mathsf{X}\to\mathsf{Y} to get a function YX\mathsf{Y}\to\mathsf{X}. Can we do the same with our channel?

It turns out the situation is not as straightforward as it was with simple functions.

If there's an 6%6\% chance I'll reach y1y_1 from x1x_1, and a 33%33\% chance I'll reach y1y_{1} from x2x_2, what are the inverted probabilities I'll reach x1x_1 from y1y_1, and x2x_{2} from y1y_{1}?

Before we can calculate that, we'll need a little more information. Specifically, we'll need the probability that we had started from each point in X\mathsf{X} in the first place. Our channel doesn't tell us that—it only tells us the probability that we would arrive at a certain point in Y\mathsf{Y}, given that we'd started from a certain point in X\mathsf{X}. Consider: if we actually start from x1x_1 100%100\% of the time, and never from x2x_2, does it make sense to say that we'd ever arrive back at x2x_2, once we reverse the arrows?

Channel mask

We can't just flip the arrows around. For one, 6%+33%6\%+33\% doesn't form a distribution over XX! We could force it to be a distribution by normalizing by the total, for example 6%6%+33%15%\frac{6\%}{6\%+33\%}\approx15\%. However, that would actually assume a 50:50 chance of starting from x1x_1 or x2x_2, which is not necessarily true!

Like with simple functions, we can run into the problem of unfollowed paths when trying to invert channels: when the probability of an arrow is zero, we never follow it – so we can't invert it.

Different from a simple function, a channel could reach any target point from any source point. And the only time we absolutely have no information about how to go backwards, is when all of the arrows arriving at a given target point have zero probability.

Zero arrows comparison

The channel on the right hand side is more problematic in general, because no matter where we start in XX, we never arrive at y3y_3. However, if it turns out that we always start from x1x_1, then we won't be able to invert the channel on the left, either, because it specifically can't get to y3y_3 from x1x_1.

The solution to this problem is called Bayes' theorem, and we'll come back to it in a later post.

It will be extremely useful. We already have a channel that tells us how likely we are to test positive or negative for a disease, given that we have it or not. Bayes' theorem will let us invert it, to get a channel that tells us how likely we are to have the disease or not, given that we test positive or negative – which in practice is the information we usually need.

"Conditional probability distribution"

The usual way that probability is taught, channels are called conditional probability distributions. Their functional nature is by no means neglected, but it does fade into the background much of the time.

On the other hand, it's very common for an experienced researcher to write p(y  x)p(y~|~x) for a value in a conditional distribution, intuitively aware that they're talking about something function-like that starts from values of X\mathsf{X} and reaches distributions over Y\mathsf{Y}.

Here, we've kept functions in the foreground since the beginning. Taking this view, channels aren't just one more tool we'll use for probability. They are the atoms of the theory – its moving parts – and almost all of our tools will be special cases of channels and their composition.

Next, we'll see two of these special cases – and the fundamentally dual roles they serve.