Statistical Hypothesis Testing
Hypothesis testing is the theory of deciding between competing statements about a population from data, while controlling the chance of each kind of error.
Definition
A statistical hypothesis test is a rule that uses sample data to decide whether to reject a null hypothesis in favor of an alternative, designed so that the probability of wrongly rejecting a true null is bounded by a chosen significance level.
Scope
This area covers the formulation of null and alternative hypotheses, the two types of error and the size and power of a test, the Neyman-Pearson lemma for the most powerful test of simple hypotheses, monotone likelihood ratio and uniformly most powerful tests, unbiased and invariant tests, the likelihood-ratio test and its large-sample chi-squared distribution, p-values and their interpretation, and the problem of testing many hypotheses at once.
Sub-topics
Core questions
- How are the size and power of a test defined, and how are the two types of error traded off?
- What test is most powerful for deciding between two simple hypotheses?
- When does a uniformly most powerful test exist for a one-sided alternative?
- How should significance be controlled when many hypotheses are tested simultaneously?
Key theories
- Neyman-Pearson lemma
- Among all tests of a given size for two simple hypotheses, the likelihood-ratio test that rejects when the ratio exceeds a threshold is most powerful.
- Uniformly most powerful and unbiased tests
- For families with monotone likelihood ratio a single test is most powerful against every alternative on one side; when no such test exists, optimality is sought within the unbiased or invariant classes.
- Likelihood-ratio tests
- The generalized likelihood-ratio statistic compares the maximized likelihoods under the null and the full model; under regularity its logarithm is asymptotically chi-squared, giving a general-purpose test.
Clinical relevance
Hypothesis tests underpin the evaluation of clinical trials, A/B testing, quality control, and signal detection, where controlling false-positive rates and ensuring adequate power directly affect which interventions, products, or discoveries are accepted as real.
History
Fisher developed significance testing and p-values in the 1920s. Neyman and Pearson introduced the decision-theoretic framework of two hypotheses, errors, and power in 1933, and Lehmann's mid-century work, continued with Romano, organized the optimality theory of tests.
Debates
- Fisherian significance versus Neyman-Pearson decisions
- Fisher viewed the p-value as a continuous measure of evidence against the null, while Neyman and Pearson framed testing as a decision with fixed error rates; the two philosophies are often blended in practice and the difference remains contested.
Key figures
- Jerzy Neyman
- Egon Pearson
- Ronald A. Fisher
- Erich L. Lehmann
Related topics
Seminal works
- lehmannRomano2005
Frequently asked questions
- What is the difference between a Type I and a Type II error?
- A Type I error rejects a true null hypothesis, a false positive; a Type II error fails to reject a false null, a false negative. The significance level bounds the first and power equals one minus the probability of the second.
- Does a small p-value prove the alternative hypothesis?
- No. A small p-value indicates the data would be unlikely under the null; it is evidence against the null, not a probability that the null is false, and it does not by itself establish practical importance.