hypothesis testing

2025-11-30

because they do not actually explain what a p-value is in intro statistics classes

It is my considered opinion as a mathematician that I learned the fundamentals of statistics much better from my Privacy and Fairness class than either I did from AP Statistics or Roofon did from her college statistics class. The latter two explained how to perform a bunch of specific hypothesis tests, but only in my Privacy and Fairness class did I learn what a hypothesis test actually is. I figured it was worth writing out this knowledge so I can solidify it a little better.

Apologies to any actual statisticians in my audience.

You can probably skip this part if “a random variable is a function from a sample space Ω to some nice space like a finite set or R or whatever” is good enough for you.

A probability space is a measure space (Ω, P) such that P(Ω) = 1. The set Ω is the sample space, and contains possible samples ω. An event E ⊆ Ω is any measurable subset of Ω. We will denote the set of measurable subsets of Ω with φ(Ω)¹¹ Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.. The measure P, a function P: φ(Ω) → [0, 1], is the probability measure, and determines how probable any given sample ω ∈ Ω or event E ⊆ Ω is. Sometimes we write P[] instead of P() if it looks nicer.

A measurable function is a function such that the preïmages of measurable sets are measurable.

Let 𝓧 be a measure space²² Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.. An 𝓧-valued random variable is a measurable function X: Ω → 𝓧. Common choices of 𝓧 are finite sets (like {Heads, Tails} or {1, 2, 3, 4, 5, 6}), R, or intervals like [0, 1]. A random variable induces a measure on 𝓧, which we also write P, given by

\begin{align}P(U) &:= P(X^{-1}(U))\\ &=P(\{\omega \in \Omega : X(\omega) \in U\}) \end{align}

If 𝓧 is, for example, R, we often write stuff like

\begin{align} P(X \geq x_0) &:= P([x_0, \infty)),\\ P(X = x_0) &:= P(\{x_0\}), \end{align}

or in general, P(ψ(X)) = P({x∈𝓧 : ψ(x)}) for any (measurable) predicate ψ: 𝓧 → {0, 1}. This notation can be seen as sort of like conflating a function with the output of that function—similar to how you might think of the formula “sin(x^2)” as referring to the function f such that f(x) = sin(x^2).

Suppose we have a random variable X: Ω → 𝓧, and we want to use hypothesis testing to learn about X. Often, the random variable is a sequence of individual random variables X = (X₁, X₂, X₃, …, X_n), and so X takes on values x = (x₁, x₂, x₃, …, x_n) in 𝓧 = (𝓧₁, 𝓧₂, 𝓧₃, …, 𝓧_n). In the common case that each X_i is real-valued, we have 𝓧 = R^n.

A hypothesis is a proposition about X. Propositions about X can be given by predicates on the set of 𝓧-valued random variables. That is, a hypothesis is a function

H : \{f: \Omega \to \mathcal{X} \mathop{|} f \text{ measurable}\} \to \{0, 1\}.

The null hypothesis is some specific hypothesis H₀.

A test statistic is a measurable function T: 𝓧 → Y³³ Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧., where Y is a measurable ordered⁴⁴ Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.

Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign. set such as R. We say that x is a more extreme outcome than x’ (according to T) when T(x) > T(x’).

The p-value of a sample x under a test-statistic T is given by

\begin{align} p &:= P[T(X_0) \geq T(x)] \\ &= P[\{ x_0 \in \mathcal{X} : T(x_0) \geq T(x)\}]\\ &= P[\{\omega \in \Omega : T(X_0(\omega)) \geq T(x)\}] \end{align}

where X₀ is a random variable compatible with H₀. You want to choose H₀ and T such that which X₀ you use does not matter—H₀ should be enough information to compute a p-value.

In English, the p-value of a sample x is the probability of an outcome at least as extreme as x, assuming that the null hypothesis holds.

For a given alternative hypothesis H_A, and significance level α, the power of a hypothesis test is the probability, assuming H_A, that p < α. That is, let y_α be the largest element of Y such that

P[\{y \in Y : y \geq y_\alpha\}] \leq \alpha.

under the null hypothesis. Then, the power is given by

P[\{\omega \in \Omega : T(X_A(\omega)) \geq y_\alpha\}].

You generally want your hypothesis test to be as powerful as possible given a particular value of α.

Example

Suppose that a gamer, Dream, might be cheating in a Minecraft speedrun. Specifically, you have a livestream VOD, and you want to check if Dream’s blaze rod drop rate has been altered.

The null hypothesis is that the drop rate is normal, 1/2. More specifically, it is that the blaze rod drops are some Bernoulli process X₀ where each trial has probability 1/2⁵⁵ We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth..

The sample x = (x₁, x₂, x₃, …, x_n) is a boolean sequence where x_i = 1 if the ith killed blaze dropped a rod, and 0 if it did not.

The test-statistic T: 2^n → N is given by T(x) = ∑x_i, and measures the number of dropped blaze rods.

The p-value is given by

\displaystyle\begin{align} p &= P[T(X_0) \geq T(x)]\\ &= \sum_{k=T(x)}^n P[T(X_0) = k]\\ &= \sum_{k=T(x)}^n \binom{n}{k}\div2^{n}. \end{align}

If p is sufficiently small, you should be suspicious of the null hypothesis.

You can use literally whatever the hell test-statistic you want⁶⁶ Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking., so long as you know how to correctly compute a p-value given a null hypothesis. The specific hypothesis tests they tell you about in school are just a collection of convenient ones for commonly occurring types of random variables and null hypotheses.

Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.
↩
Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.
↩
Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧.
↩
Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign.
↩
We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth.
↩
Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking.
↩