hypothesis testing
because they do not actually explain what a p-value is in intro statistics classes
It is my considered opinion as a mathematician that I learned the fundamentals of statistics much better from my Privacy and Fairness class than either I did from AP Statistics or Roofon did from her college statistics class. The latter two explained how to perform a bunch of specific hypothesis tests, but only in my Privacy and Fairness class did I learn what a hypothesis test actually is. I figured it was worth writing out this knowledge so I can solidify it a little better.
Apologies to any actual statisticians in my audience.
You can probably skip this part if “a random variable is a function from a sample space Ω to some nice space like a finite set or R or whatever” is good enough for you.
A probability space is a measure space (Ω, P) such that P(Ω) = 1. The set Ω is the sample space, and contains possible samples ω. An event E ⊆ Ω is any measurable subset of Ω. We will denote the set of measurable subsets of Ω with φ(Ω)11 Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.. The measure P, a function P: φ(Ω) → [0, 1], is the probability measure, and determines how probable any given sample ω ∈ Ω or event E ⊆ Ω is. Sometimes we write P[] instead of P() if it looks nicer.
A measurable function is a function such that the preïmages of measurable sets are measurable.
Let 𝓧 be a measure space22 Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.. An 𝓧-valued random variable is a measurable function X: Ω → 𝓧. Common choices of 𝓧 are finite sets (like {Heads, Tails} or {1, 2, 3, 4, 5, 6}), R, or intervals like [0, 1]. A random variable induces a measure on 𝓧, which we also write P, given by
If 𝓧 is, for example, R, we often write stuff like
or in general, P(ψ(X)) = P({x∈𝓧 : ψ(x)}) for any (measurable) predicate ψ: 𝓧 → {0, 1}. This notation can be seen as sort of like conflating a function with the output of that function—similar to how you might think of the formula “sin(x^2)” as referring to the function f such that f(x) = sin(x^2).
Suppose we have a random variable X: Ω → 𝓧, and we want to use hypothesis testing to learn about X. Often, the random variable is a sequence of individual random variables X = (X1, X2, X3, …, Xn), and so X takes on values x = (x1, x2, x3, …, xn) in 𝓧 = (𝓧1, 𝓧2, 𝓧3, …, 𝓧n). In the common case that each Xi is real-valued, we have 𝓧 = R^n.
A hypothesis is a proposition about X. Propositions about X can be given by predicates on the set of 𝓧-valued random variables. That is, a hypothesis is a function
The null hypothesis is some specific hypothesis H0.
A test statistic is a measurable function T: 𝓧 → Y33 Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧., where Y is a measurable ordered44 Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign. set such as R. We say that x is a more extreme outcome than x’ (according to T) when T(x) > T(x’).
The p-value of a sample x under a test-statistic T is given by
where X0 is a random variable compatible with H0. You want to choose H0 and T such that which X0 you use does not matter—H0 should be enough information to compute a p-value.
In English, the p-value of a sample x is the probability of an outcome at least as extreme as x, assuming that the null hypothesis holds.
For a given alternative hypothesis HA, and significance level α, the power of a hypothesis test is the probability, assuming HA, that p < α. That is, let yα be the largest element of Y such that
under the null hypothesis. Then, the power is given by
You generally want your hypothesis test to be as powerful as possible given a particular value of α.
Example
Suppose that a gamer, Dream, might be cheating in a Minecraft speedrun. Specifically, you have a livestream VOD, and you want to check if Dream’s blaze rod drop rate has been altered.
The null hypothesis is that the drop rate is normal, 1/2. More specifically, it is that the blaze rod drops are some Bernoulli process X0 where each trial has probability 1/255 We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth..
The sample x = (x1, x2, x3, …, xn) is a boolean sequence where xi = 1 if the ith killed blaze dropped a rod, and 0 if it did not.
The test-statistic T: 2^n → N is given by T(x) = ∑xi, and measures the number of dropped blaze rods.
The p-value is given by
If p is sufficiently small, you should be suspicious of the null hypothesis.
You can use literally whatever the hell test-statistic you want66 Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking., so long as you know how to correctly compute a p-value given a null hypothesis. The specific hypothesis tests they tell you about in school are just a collection of convenient ones for commonly occurring types of random variables and null hypotheses.
Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.
↩Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.
↩Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧.
↩Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign.
↩We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth.
↩Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking.
↩