Bayesian Inference
- File size
- 6.6KB
- Lines of code
- 110
Bayesian Inference
Notes on the 3Blue1Brown series "Probabilities of probabilities," which provides an introduction to the core ideas of Bayesian inference.
Introduction: The Bayesian View
Bayesian inference is a framework for thinking about probability as a measure of belief in a proposition. The core idea is to update our beliefs in the light of new evidence. This is in contrast to the frequentist interpretation of probability, which sees probability as the long-run frequency of an event.
Bayes' Theorem
The mathematical engine of Bayesian inference is Bayes' Theorem. It tells us how to update our belief in a hypothesis ($H$) after observing some evidence ($E$).
$$
P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}
$$
Let's break down the terms:
* $P(H|E)$: The Posterior. This is the probability of our hypothesis being true, given the evidence. This is what we want to calculate.
* $P(E|H)$: The Likelihood. This is the probability of observing the evidence, assuming our hypothesis is true.
* $P(H)$: The Prior. This is our initial belief in the hypothesis, before we've seen any evidence.
* $P(E)$: The Marginal Likelihood. This is the total probability of observing the evidence, under all possible hypotheses. It acts as a normalization constant.
In practice, we often write the theorem as:
$$
\text{Posterior} \propto \text{Likelihood} \cdot \text{Prior}
$$
From Numbers to Distributions
In many real-world problems, we are not interested in a single probability, but a continuous range of possibilities. For example, instead of asking "is this coin fair?", we might ask "what is the probability of this coin landing heads?". This is a value that could be anywhere between 0 and 1.
In the Bayesian framework, we can represent our belief about this unknown probability with a probability distribution.
The Beta Distribution
The Beta distribution is a family of continuous probability distributions defined on the interval [0, 1]. It is a very natural choice for representing a belief about a probability. It is defined by two positive shape parameters, $\alpha$ and $\beta$.
$$
\text{Beta}(\alpha, \beta)
$$
- The mean of the distribution is $\frac{\alpha}{\alpha + \beta}$.
- The shape of the distribution can be interpreted as representing the knowledge gained from $\alpha - 1$ "successes" and $\beta - 1$ "failures".
For example, a flat prior (representing no knowledge) can be modeled with a Beta(1, 1) distribution, which is a uniform distribution.
Conjugate Priors
The Beta distribution has a very special relationship with the binomial distribution (which describes the number of successes in a series of independent trials). The Beta distribution is a conjugate prior for the binomial likelihood.
This means that if you:
1. Start with a prior belief about a probability, represented by a Beta distribution, $\text{Beta}(\alpha, \beta)$.
2. Observe new evidence in the form of $k$ successes and $n-k$ failures.
3. Your posterior belief will also be a Beta distribution, with updated parameters:
$$
\text{Beta}(\alpha + k, \beta + n - k)
$$
This makes calculations much easier and provides a very intuitive way to think about updating beliefs. You just add the number of successes to $\alpha$ and the number of failures to $\beta$.
Example: Coin Flipping
Suppose you want to determine the fairness of a coin.
* Prior: You have no idea if it's fair, so you start with a uniform prior, $\text{Beta}(1, 1)$.
* Evidence: You flip the coin 10 times and get 7 heads and 3 tails.
* Posterior: Your new belief about the coin's probability of landing heads is represented by the distribution $\text{Beta}(1+7, 1+3) = \text{Beta}(8, 4)$.
The peak of this new distribution is at $7/10 = 0.7$, which is our most likely estimate for the probability of heads. But the distribution also tells us about our uncertainty; there's still a chance the coin is fair, or even biased towards tails.
Bayesian A/B Testing
This framework can be applied to A/B testing. Instead of p-values, we can directly calculate the probability that variant B is better than variant A. We can model our belief about the conversion rate of each variant with a Beta distribution, update it with new data, and then compare the posterior distributions.