StatisticsHypothesis Testing

Hypothesis testing is a disciplined framework for adjudicating whether observed data do not support a given hypothesis.

Consider an unknown distribution from which we will observe samples .

We state a hypothesis —called the null hypothesis—about the distribution.
We come up with a test statistic , which is a function of the data , for which we can evaluate the distribution of assuming the null hypothesis.
We give an alternative hypothesis under which is expected to be significantly different from its value under .
We give a significance level (like 5% or 1%), and based on we determine a set of values for —called the critical region—which would be in with probability at most under the null hypothesis.
After setting , , , , and the critical region, we run the experiment, evaluate on the samples we get, and record the result as .
If falls in the critical region, we reject the null hypothesis. The corresponding p-value is defined to be the minimum -value which would have resulted in rejecting the null hypothesis, with the critical region chosen in the same way.

Example
Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first. She is given eight cups of tea, four poured milk-first and four poured tea-first.

We posit a null hypothesis that she isn't able to discern the pouring method, and an alternative hypothesis that she can tell the difference. How many cups does she have to identify correctly to reject the null hypothesis with 95% confidence?

Solution. Under the null hypothesis, the number of cups identified correctly is 4 with probability and at least 3 with probability . Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject the null hypothesis. The -value in that case would be 1.4%.

Failure to reject the null hypothesis is not necessarily evidence for the null hypothesis. The power of a hypothesis test is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true. A -value may be low either because the null hypothesis is true or because the test has low power.

The Wald test and the t-test

Definition
The Wald test is based on the normal approximation. Consider a null hypothesis and the alternative hypothesis , and suppose that is approximately normally distributed. The Wald test rejects the null hypothesis at the 5% significance level if .

Example
Consider the alternative hypothesis that 8-cylinder engines have lower fuel economy than 6-cylinder engines (with null hypothesis that they are the same). Apply the Wald test, using the data below from the R dataset mtcars.

six_cyl_mpgs = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight_cyl_mpgs = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]

Solution. We frame the problem as a question about whether the difference in means between the distribution of 8-cylinder mpg values and the distribution of 6-cylinder mpg values is zero. We use the difference between the sample means and of the two populations as an estimator of the difference in means. If we think of the records in the data frame as independent, then and are independent. Since each is approximately normally distributed by the central limit theorem, their difference is therefore also approximately normal. So, let's calculate the sample mean and sample variance for the 8-cylinder cars and for the 6-cylinder cars.

using Statistics
six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]
m₁, m₂ = mean(six), mean(eight)
s₁, s₂ = std(six), std(eight)
n₁, n₂ = length(six), length(eight)

library(tidyverse)

stats <- mtcars %>%
  group_by(cyl) %>%
  filter(cyl %in% c(6,8)) %>%
  summarise(m = mean(mpg), S2 = var(mpg), n = n(), se = sqrt(S2/n))

Given that the distribution of 8-cylinder mpg values has variance , the variance of the sample mean is , where is the number of 8-cylinder vehicles (and similarly for ). Therefore, we estimate the variance of the difference in sample means as

Under the null hypothesis, therefore, has mean zero and standard error . We therefore reject the null hypothesis with 95% confidence if the value of divided by its estimated standard error exceeds 1.96. We find that

z <- (stats$m[1] - stats$m[2]) / sqrt(sum(stats$se^2))

z = (m₁ - m₂) / sqrt(s₁^2/n₁ + s₂^2/n₂)

returns , so we do reject the null hypothesis at the 95% confidence level. The -value of this test is 1-cdf(Normal(0,1),z) .

The Wald test can be overconfident because it doesn't account for the fact that the standard deviation values are estimated from the data:

Exercise
Experiment with the code block below to see how, even when the initial distribution is normal, standardizing the mean using the estimated standard deviation results in a non-normal distribution. How is this distribution different? Around what value of does the graph become visually indistinguishable from the normal distribution (in this visualization)?

using Plots, Distributions
n = 6
μ = 3
sample(n) = [μ + 0.5randn() for _ in 1:n]
standardize(X) = (mean(X) - μ)/(std(X)/√(length(X)))
histogram([standardize(sample(n)) for _ in 1:1_000_000],
           xlims = (-6,6), normed=true, label="standardized mean")
plot!(-6:0.05:6, x-> pdf(Normal(0,1),x), linewidth = 3,
      label = "standard normal density", opacity = 0.75)

Solution. The distribution of apparently has heavier tails that the normal distribution. Based on the graph, it appears that this effect is more noticeable for less than 30 than for greater than 100. (Both of these numbers are arbitrary; the main point is that it doesn't take huge values of for the distribution to start looking fairly normal.)

If is a sequence of normal random variables with mean and variance , let's define to be the average of 's, and to be the sample variance, so . Then the distribution of is called the t-distribution with degrees of freedom.

Exercise
Use your knowledge of the t-distribution to test the hypothesis that the mean of the distribution used to generate the following list of numbers has mean greater than 4.

Note: you can create an object to represent the t-distribution with ν degrees of freedom using the expression TDist(ν). To evaluate its cumulative distribution function at x, use cdf(TDist(ν), x).

X = [4.1, 5.12, 3.39, 4.97, 3.07, 4.17, 4.46, 5.53, 3.28, 3.62]

Solution. We define the statistic , which under the null hypothesis is -distributed with 9 degrees of freedom. We compute

t = (mean(X) - 4) / (std(X)/length(X))

which is approximately 2.029, and then

1 - cdf(TDist(length(X)-1), t)

which is about 3.7%. So we are able to reject the null hypothesis at the 5% significance level.

There are a variety of -tests, including one appropriate to the mpg problem discussed above:

Exercise
Redo the mpg problem above with the Welch's t-test instead of the Wald test. This test says that the statistic

is, under the null hypothesis, -distributed with

degrees of freedom.

using Statistics
six = [21.0, 21.0, 21.4, 18.1, 19.2, 17.8, 19.7]
eight = [18.7, 14.3, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 15.5, 15.2, 13.3, 19.2, 15.8, 15.0]
m₁, m₂ = mean(six), mean(eight)
s₁, s₂ = std(six), std(eight)
n₁, n₂ = length(six), length(eight)

Solution. We calculate

a = s₁^2/n₁
b = s₂^2/n₂
ν = (a + b)^2 / (a^2/(n₁-1) + b^2/(n₂-1))
t = (m₁ - m₂)/sqrt(a+b)
ccdf(TDist(ν), t)

(Note that ccdf is the same as 1-cdf.) The value returned is , so we still reject the null hypothesis, but the -value is higher than what we got previously.

Random Permutation Test

The following test is more flexible than the Wald test, since it doesn't rely on the normal approximation. It's based on a simple idea: if there's no difference in labels, the data shouldn't look very different if we shuffle them around.

Definition
The random permutation test is applicable when the null hypothesis is that two distributions are the same.

We compute the difference between the sample means for the two groups.
We randomly re-assign the group labels and compute the resulting sample mean differences. Repeat many times.
We check where the original difference falls in the sorted list of re-sampled differences.

Example
Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches. Consider the null hypothesis that the height distributions for the two families are the same, with the alternative hypothesis that they are not. Determine whether a random permutation test applied to the absolute sample mean difference rejects the null hypothesis at significance level .

Solution. We find that the absolute sample mean difference of about 2.4 inches is larger than only about 68% of the mean differences obtained by resampling many times.

set.seed(123)
romero <- c(72, 69, 68, 66)
larsen <- c(70, 65, 64)
actual.diff <- abs(mean(romero) - mean(larsen))

resample.diff <- function(n) {
  shuffled <- sample(c(romero,larsen))
  abs(mean(shuffled[1:4]) - mean(shuffled[5:7]))
}

sum(sapply(1:10000,resample.diff) < actual.diff)

Since 68% < 95%, we retain the null hypothesis.

Multiple testing

If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high. This is called the multiple testing problem.

The Bonferroni method is to reject the null hypothesis only for those tests whose -values are less than divided by the number of hypothesis tests being run. This ensures that the probability of having even one false rejection is less than , so it is very conservative.

Example
Suppose that 10 different genes are tested to determine whether they have an affect on heart disease. The 10 -values resulting from these hypothesis tests are (rounded to the nearest hundredth of a percent):

Which results are reported as significant at the 5% level, according to the Bonferroni method?

Solution. At the 5% level, only values less than 5%/10 = 0.5% are reported as significant (since we ran ten hypothesis tests). Since none of the values are below 0.5%, none of the genes will be considered significant.

Hypothesis testing is often viewed by learners of statistics as potentially misleading. In fact, this thought is not uncommon among professional statisticians and other scientists as well. See, for example, this comment in Nature, which was part of a widespread discussion of -values in the statistics community in early 2019.

Despite these concerns, it's useful to be understand the basics of hypothesis testing, because it remains a widely used framework, and conveys a critical lesson about the hazards of extracting hypotheses from data rather the other way around (using data to scrutinize hypotheses).

Changer de langue

Connectez-vous à Mathigon

Partager

Réinitialiser la progression

Glossaire

StatisticsHypothesis Testing

The Wald test and the t-test

Random Permutation Test

Multiple testing