Collaboration diagram for Hypothesis Tests:

About:

Hypothesis tests are used to confirm or reject a “null” hypothesis \( H_0 \) about the distribution of random variables, given realizations of these random variables. Since in general it is not possible to make statements with certainty, one is interested in the probability \( p \) of seeing random variates at least as extreme as the ones observed, assuming that \( H_0 \) is true. If this probability \( p \) is small, \( H_0 \) will be rejected by the test with significance level \( p \). Falsifying \( H_0 \) is the canonic goal when employing a hypothesis test. That is, hypothesis tests are typically used in order to substantiate that instead the alternative hypothesis \( H_1 \) is true.

Hypothesis tests may be devided into parametric and non-parametric tests. A parametric test assumes certain distributions and makes inferences about parameters of the distributions (like, e.g., the mean of a normal distribution). Formally, there is a given domain of possible parameters \( \Gamma \) and the null hypothesis \( H_0 \) is the event that the true parameter \( \gamma_0 \in \Gamma_0 \), where \( \Gamma_0 \subsetneq \Gamma \). Non-parametric tests, on the other hand, do not assume any particular distribution of the sample (e.g., a non-parametric test may simply test if two distributions are similar).

The first step of a hypothesis test is to compute a test statistic, which is a function of the random variates, i.e., a random variate itself. A hypothesis test relies on that the distribution of the test statistic is (approximately) known. Now, the \( p \)-value is the probability of seeing a test statistic at least as extreme as the one observed, assuming that \( H_0 \) is true. In a case where the null hypothesis corresponds to a family of distributions (e.g., in a parametric test where \( \Gamma_0 \) is not a singleton set), the \( p \)-value is the supremum, over all possible distributions according to the null hypothesis, of these probabilities.

Input:

Input data is assumed to be normalized with all values stored row-wise. In general, the following inputs are expected.

One-sample tests expect the following form:

{TABLE|VIEW} source (
    ...
    value DOUBLE PRECISION
    ...
)

Two-sample tests expect the following form:

{TABLE|VIEW} source (
    ...
    first BOOLEAN,
    value DOUBLE PRECISION
    ...
)

Here, first indicates whether a value is from the first (if TRUE) or the second sample (if FALSE).

Many-sample tests expect the following form:

{TABLE|VIEW} source (
    ...
    group INTEGER,
    value DOUBLE PRECISION
    ...
)

Usage:

All tests are implemented as aggregate functions. The non-parametric (rank-based) tests are implemented as ordered aggregate functions and thus necessitate an ORDER BY clause. In the following, the most simple forms of usage are given. Specific function signatures, as described in hypothesis_tests.sql_in, may ask for more arguments or for a different ORDER BY clause.

Run a parametric one-sample test:
```
SELECT test(value) FROM source
```
Run a parametric two-sample test:
```
SELECT test(first, value) FROM source
```

Run a non-parametric one-sample test:

SELECT test(value ORDER BY value) FROM source

Run a non-parametric two-sample test:

SELECT test(first, value ORDER BY value) FROM source

Examples:

See hypothesis_tests.sql_in for examples for each of the aggregate functions.

Literature:

[1] M. Hollander, D. Wolfe: Nonparametric Statistical Methods, 2nd edition, Wiley, 1999

[2] E. Lehmann, J. Romano: Testing Statistical Hypotheses, 3rd edition, Springer, 2005

See Also: File hypothesis_tests.sql_in documenting the SQL functions.