apriiori

hypothesis testing

because they do not actually explain what a p-value is in intro statistics classes

It is my considered opinion as a mathematician that I learned the fundamentals of statistics much better from my Privacy and Fairness class than either I did from AP Statistics or Roofon did from her college statistics class. The latter two explained how to perform a bunch of specific hypothesis tests, but only in my Privacy and Fairness class did I learn what a hypothesis test actually is. I figured it was worth writing out this knowledge so I can solidify it a little better.

Apologies to any actual statisticians in my audience.


You can probably skip this part if “a random variable is a function from a sample space Ω to some nice space like a finite set or R or whatever” is good enough for you.

A probability space is a measure space (Ω, P) such that P(Ω) = 1. The set Ω is the sample space, and contains possible samples ω. An event E ⊆ Ω is any measurable subset of Ω. We will denote the set of measurable subsets of Ω with φ(Ω)11 Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.. The measure P, a function P: φ(Ω) → [0, 1], is the probability measure, and determines how probable any given sample ω ∈ Ω or event E ⊆ Ω is. Sometimes we write P[] instead of P() if it looks nicer.

A measurable function is a function such that the preïmages of measurable sets are measurable.

Let 𝓧 be a measure space22 Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.. An 𝓧-valued random variable is a measurable function X: Ω → 𝓧. Common choices of 𝓧 are finite sets (like {Heads, Tails} or {1, 2, 3, 4, 5, 6}), R, or intervals like [0, 1]. A random variable induces a measure on 𝓧, which we also write P, given by

$$ \begin{align}P(U) &:= P(X^{-1}(U))\\ &=P(\{\omega \in \Omega : X(\omega) \in U\}) \end{align} $$

If 𝓧 is, for example, R, we often write stuff like

$$ \begin{align} P(X \geq x_0) &:= P([x_0, \infty)),\\ P(X = x_0) &:= P(\{x_0\}), \end{align} $$

or in general, P(ψ(X)) = P({x∈𝓧 : ψ(x)}) for any (measurable) predicate ψ: 𝓧 → {0, 1}. This notation can be seen as sort of like conflating a function with the output of that function—similar to how you might think of the formula “sin(x^2)” as referring to the function f such that f(x) = sin(x^2).


Suppose we have a random variable X: Ω → 𝓧, and we want to use hypothesis testing to learn about X. Often, the random variable is a sequence of individual random variables X = (X1, X2, X3, …, Xn), and so X takes on values x = (x1, x2, x3, …, xn) in 𝓧 = (𝓧1, 𝓧2, 𝓧3, …, 𝓧n). In the common case that each Xi is real-valued, we have 𝓧 = R^n.

A hypothesis is a proposition about X. Propositions about X can be given by predicates on the set of 𝓧-valued random variables. That is, a hypothesis is a function

$$ H : \{f: \Omega \to \mathcal{X} \mathop{|} f \text{ measurable}\} \to \{0, 1\}. $$

The null hypothesis is some specific hypothesis H0.

A test statistic is a measurable function T: 𝓧 → Y33 Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧., where Y is a measurable ordered44 Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.

Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign.
set such as R. We say that x is a more extreme outcome than x’ (according to T) when T(x) > T(x’).

The p-value of a sample x under a test-statistic T is given by

$$ \begin{align} p &:= P[T(X_0) \geq T(x)] \\ &= P[\{ x_0 \in \mathcal{X} : T(x_0) \geq T(x)\}]\\ &= P[\{\omega \in \Omega : T(X_0(\omega)) \geq T(x)\}] \end{align} $$

where X0 is a random variable compatible with H0. You want to choose H0 and T such that which X0 you use does not matter—H0 should be enough information to compute a p-value.

In English, the p-value of a sample x is the probability of an outcome at least as extreme as x, assuming that the null hypothesis holds.

For a given alternative hypothesis HA, and significance level α, the power of a hypothesis test is the probability, assuming HA, that p < α. That is, let yα be the largest element of Y such that

$$ P[\{y \in Y : y \geq y_\alpha\}] \leq \alpha. $$

under the null hypothesis. Then, the power is given by

$$ P[\{\omega \in \Omega : T(X_A(\omega)) \geq y_\alpha\}]. $$

You generally want your hypothesis test to be as powerful as possible given a particular value of α.

Example

Suppose that a gamer, Dream, might be cheating in a Minecraft speedrun. Specifically, you have a livestream VOD, and you want to check if Dream’s blaze rod drop rate has been altered.

The null hypothesis is that the drop rate is normal, 1/2. More specifically, it is that the blaze rod drops are some Bernoulli process X0 where each trial has probability 1/255 We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth..

The sample x = (x1, x2, x3, …, xn) is a boolean sequence where xi = 1 if the ith killed blaze dropped a rod, and 0 if it did not.

The test-statistic T: 2^n → N is given by T(x) = ∑xi, and measures the number of dropped blaze rods.

The p-value is given by

$$ \displaystyle\begin{align} p &= P[T(X_0) \geq T(x)]\\ &= \sum_{k=T(x)}^n P[T(X_0) = k]\\ &= \sum_{k=T(x)}^n \binom{n}{k}\div2^{n}. \end{align} $$

If p is sufficiently small, you should be suspicious of the null hypothesis.

You can use literally whatever the hell test-statistic you want66 Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking., so long as you know how to correctly compute a p-value given a null hypothesis. The specific hypothesis tests they tell you about in school are just a collection of convenient ones for commonly occurring types of random variables and null hypotheses.

  1. Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.

  2. Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.

  3. Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of 𝓧.

  4. Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.

    Slightly trollish definition that might work: a one-sided test-statistic is a measurable function T: 𝓧 → R, where R is given the partial order where 0 < x for all x, positive and negative numbers are incomparable to each other, and x < x’ if |x| < |x’| when x and x’ have the same sign. The corresponding two-sided test-statistic uses the order where x ≤ x’ whenever |x| ≤ |x’|, regardless of sign.

  5. We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth.

  6. Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking.