User Documentation
 All Files Functions Groups
hypothesis_tests.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- */
2 /**
3  *
4  * @file hypothesis_tests.sql_in
5  *
6  * @brief SQL functions for statistical hypothesis tests
7  *
8  * @sa For an overview of hypthesis-test functions, see the module
9  * description \ref grp_stats_tests.
10  *
11  */
12  /* ----------------------------------------------------------------------- */
13 
14 m4_include(`SQLCommon.m4')
15 m4_changequote(<!,!>)
16 
17 /**
18 @addtogroup grp_stats_tests
19 
20 @about
21 
22 Hypothesis tests are used to confirm or reject a <em>“null” hypothesis</em>
23 \f$ H_0 \f$ about the distribution of random variables, given realizations of
24 these random variables. Since in general it is not possible to make statements
25 with certainty, one is interested in the probability \f$ p \f$ of seeing random
26 variates at least as extreme as the ones observed, assuming that \f$ H_0 \f$ is
27 true. If this probability \f$ p \f$ is small, \f$ H_0 \f$ will be rejected by
28 the test with <em>significance level</em> \f$ p \f$. Falsifying \f$ H_0 \f$ is
29 the canonic goal when employing a hypothesis test. That is, hypothesis tests are
30 typically used in order to substantiate that instead the <em>alternative
31 hypothesis</em> \f$ H_1 \f$ is true.
32 
33 Hypothesis tests may be devided into parametric and non-parametric tests. A
34 parametric test assumes certain distributions and makes inferences about
35 parameters of the distributions (like, e.g., the mean of a normal distribution).
36 Formally, there is a given domain of possible parameters \f$ \Gamma \f$ and the
37 null hypothesis \f$ H_0 \f$ is the event that the true parameter
38 \f$ \gamma_0 \in \Gamma_0 \f$, where \f$ \Gamma_0 \subsetneq \Gamma \f$.
39 Non-parametric tests, on the other hand, do not assume any particular
40 distribution of the sample (e.g., a non-parametric test may simply test if two
41 distributions are similar).
42 
43 The first step of a hypothesis test is to compute a <em>test statistic</em>,
44 which is a function of the random variates, i.e., a random variate itself.
45 A hypothesis test relies on that the distribution of the test statistic is
46 (approximately) known. Now, the \f$ p \f$-value is the probability of seeing a
47 test statistic at least as extreme as the one observed, assuming that
48 \f$ H_0 \f$ is true. In a case where the null hypothesis corresponds to a family
49 of distributions (e.g., in a parametric test where \f$ \Gamma_0 \f$ is not a
50 singleton set), the \f$ p \f$-value is the supremum, over all possible
51 distributions according to the null hypothesis, of these probabilities.
52 
53 @input
54 
55 Input data is assumed to be normalized with all values stored row-wise. In
56 general, the following inputs are expected.
57 
58 One-sample tests expect the following form:
59 <pre>{TABLE|VIEW} <em>source</em> (
60  ...
61  <em>value</em> DOUBLE PRECISION
62  ...
63 )</pre>
64 
65 Two-sample tests expect the following form:
66 <pre>{TABLE|VIEW} <em>source</em> (
67  ...
68  <em>first</em> BOOLEAN,
69  <em>value</em> DOUBLE PRECISION
70  ...
71 )</pre>
72 Here, \c first indicates whether a value is from the first (if \c TRUE) or the
73 second sample (if \c FALSE).
74 
75 Many-sample tests expect the following form:
76 <pre>{TABLE|VIEW} <em>source</em> (
77  ...
78  <em>group</em> INTEGER,
79  <em>value</em> DOUBLE PRECISION
80  ...
81 )</pre>
82 
83 @usage
84 
85 All tests are implemented as aggregate functions. The non-parametric
86 (rank-based) tests are implemented as ordered aggregate functions and thus
87 necessitate an <tt>ORDER BY</tt> clause. In the following, the most simple
88 forms of usage are given. Specific function signatures, as described in
89 \ref hypothesis_tests.sql_in, may ask for more arguments or for a different
90 <tt>ORDER BY</tt> clause.
91 
92 - Run a parametric one-sample test:
93  <pre>SELECT <em>test</em>(<em>value</em>) FROM <em>source</em></pre>
94 - Run a parametric two-sample test:
95  <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em>) FROM <em>source</em></pre>
96 - Run a non-parametric one-sample test:
97  <pre>SELECT <em>test</em>(<em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre>
98 - Run a non-parametric two-sample test:
99  <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre>
100 
101 @examp
102 
103 See \ref hypothesis_tests.sql_in for examples for each of the aggregate
104 functions.
105 
106 @literature
107 
108 [1] M. Hollander, D. Wolfe: <em>Nonparametric Statistical Methods</em>,
109  2nd edition, Wiley, 1999
110 
111 [2] E. Lehmann, J. Romano: <em>Testing Statistical Hypotheses</em>, 3rd edition,
112  Springer, 2005
113 
114 @sa File hypothesis_tests.sql_in documenting the SQL functions.
115 */
116 
117 CREATE TYPE MADLIB_SCHEMA.t_test_result AS (
118  statistic DOUBLE PRECISION,
119  df DOUBLE PRECISION,
120  p_value_one_sided DOUBLE PRECISION,
121  p_value_two_sided DOUBLE PRECISION
122 );
123 
124 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_transition(
125  state DOUBLE PRECISION[],
126  value DOUBLE PRECISION
127 ) RETURNS DOUBLE PRECISION[]
128 AS 'MODULE_PATHNAME'
129 LANGUAGE C
130 IMMUTABLE
131 STRICT;
132 
133 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_merge_states(
134  state1 DOUBLE PRECISION[],
135  state2 DOUBLE PRECISION[])
136 RETURNS DOUBLE PRECISION[]
137 AS 'MODULE_PATHNAME'
138 LANGUAGE C
139 IMMUTABLE STRICT;
140 
141 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_final(
142  state DOUBLE PRECISION[])
143 RETURNS MADLIB_SCHEMA.t_test_result
144 AS 'MODULE_PATHNAME'
145 LANGUAGE C IMMUTABLE STRICT;
146 
147 CREATE TYPE MADLIB_SCHEMA.f_test_result AS (
148  statistic DOUBLE PRECISION,
149  df1 DOUBLE PRECISION,
150  df2 DOUBLE PRECISION,
151  p_value_one_sided DOUBLE PRECISION,
152  p_value_two_sided DOUBLE PRECISION
153 );
154 
155 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.f_test_final(
156  state DOUBLE PRECISION[])
157 RETURNS MADLIB_SCHEMA.f_test_result
158 AS 'MODULE_PATHNAME'
159 LANGUAGE C IMMUTABLE STRICT;
160 
161 
162 /**
163  * @brief Perform one-sample or dependent paired Student t-test
164  *
165  * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
166  * \f$ X_1, \dots, X_n \sim N(\mu, \sigma^2) \f$ with unknown parameters \f$ \mu \f$ and
167  * \f$ \sigma^2 \f$, test the null hypotheses \f$ H_0 : \mu \leq 0 \f$ and
168  * \f$ H_0 : \mu = 0 \f$.
169  *
170  * @param value Value of random variate \f$ x_i \f$
171  *
172  * @return A composite value as follows. We denote by \f$ \bar x \f$ the
173  * \ref sample_mean "sample mean" and by \f$ s^2 \f$ the
174  * \ref sample_variance "sample variance".
175  * - <tt>statistic FLOAT8</tt> - Statistic
176  * \f[
177  * t = \frac{\sqrt n \cdot \bar x}{s}
178  * \f]
179  * The corresponding random
180  * variable is Student-t distributed with
181  * \f$ (n - 1) \f$ degrees of freedom.
182  * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n - 1) \f$
183  * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
184  * In detail, the result is \f$ \Pr[\bar X \geq \bar x \mid \mu = 0] \f$,
185  * which is a lower bound on
186  * \f$ \Pr[\bar X \geq \bar x \mid \mu \leq 0] \f$. Computed as
187  * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
188  * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
189  * \f$ \Pr[ |\bar X| \geq |\bar x| \mid \mu = 0] \f$. Computed as
190  * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
191  *
192  * @usage
193  * - One-sample t-test: Test null hypothesis that the mean of a sample is at
194  * most (or equal to, respectively) \f$ \mu_0 \f$:
195  * <pre>SELECT (t_test_one(<em>value</em> - <em>mu_0</em>)).* FROM <em>source</em></pre>
196  * - Dependent paired t-test: Test null hypothesis that the mean difference
197  * between the first and second value in each pair is at most (or equal to,
198  * respectively) \f$ \mu_0 \f$:
199  * <pre>SELECT (t_test_one(<em>first</em> - <em>second</em> - <em>mu_0</em>)).*
200  * FROM <em>source</em></pre>
201  */
202 CREATE AGGREGATE MADLIB_SCHEMA.t_test_one(
203  /*+ value */ DOUBLE PRECISION) (
204 
205  SFUNC=MADLIB_SCHEMA.t_test_one_transition,
206  STYPE=DOUBLE PRECISION[],
207  FINALFUNC=MADLIB_SCHEMA.t_test_one_final,
208  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
209  INITCOND='{0,0,0,0,0,0,0}'
210 );
211 
212 
213 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_transition(
214  state DOUBLE PRECISION[],
215  "first" BOOLEAN,
216  "value" DOUBLE PRECISION)
217 RETURNS DOUBLE PRECISION[]
218 AS 'MODULE_PATHNAME'
219 LANGUAGE C
220 IMMUTABLE
221 STRICT;
222 
223 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_pooled_final(
224  state DOUBLE PRECISION[])
225 RETURNS MADLIB_SCHEMA.t_test_result
226 AS 'MODULE_PATHNAME'
227 LANGUAGE C IMMUTABLE STRICT;
228 
229 /**
230  * @brief Perform two-sample pooled (i.e., equal variances) Student t-test
231  *
232  * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
233  * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma^2) \f$ and
234  * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
235  * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
236  * \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
237  *
238  * @param first Indicator whether \c value is from first sample
239  * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
240  * \f$ y_1, \dots, y_m \f$ (if \c FALSE)
241  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
242  *
243  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
244  * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
245  * \ref sample_variance "sample variances".
246  * - <tt>statistic FLOAT8</tt> - Statistic
247  * \f[
248  * t = \frac{\bar x - \bar y}{s_p \sqrt{1/n + 1/m}}
249  * \f]
250  * where
251  * \f[
252  * s_p^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2
253  * + \sum_{i=1}^m (y_i - \bar y)^2}
254  * {n + m - 2}
255  * \f]
256  * is the <em>pooled variance</em>.
257  * The corresponding random
258  * variable is Student-t distributed with
259  * \f$ (n + m - 2) \f$ degrees of freedom.
260  * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n + m - 2) \f$
261  * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
262  * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
263  * which is a lower bound on
264  * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
265  * Computed as
266  * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
267  * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
268  * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
269  * Computed as
270  * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
271  *
272  * @usage
273  * - Two-sample pooled t-test: Test null hypothesis that the mean of the first
274  * sample is at most (or equal to, respectively) the mean of the second
275  * sample:
276  * <pre>SELECT (t_test_pooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
277  */
278 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_pooled(
279  /*+ "first" */ BOOLEAN,
280  /*+ "value" */ DOUBLE PRECISION) (
281 
282  SFUNC=MADLIB_SCHEMA.t_test_two_transition,
283  STYPE=DOUBLE PRECISION[],
284  FINALFUNC=MADLIB_SCHEMA.t_test_two_pooled_final,
285  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
286  INITCOND='{0,0,0,0,0,0,0}'
287 );
288 
289 
290 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_unpooled_final(
291  state DOUBLE PRECISION[])
292 RETURNS MADLIB_SCHEMA.t_test_result
293 AS 'MODULE_PATHNAME'
294 LANGUAGE C IMMUTABLE STRICT;
295 
296 /**
297  * @brief Perform unpooled (i.e., unequal variances) t-test (also known as
298  * Welch's t-test)
299  *
300  * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
301  * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma_X^2) \f$ and
302  * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma_Y^2) \f$ with unknown parameters
303  * \f$ \mu_X, \mu_Y, \sigma_X^2, \f$ and \f$ \sigma_Y^2 \f$, test the null
304  * hypotheses \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
305  *
306  * @param first Indicator whether \c value is from first sample
307  * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
308  * \f$ y_1, \dots, y_m \f$ (if \c FALSE)
309  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
310  *
311  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
312  * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
313  * \ref sample_variance "sample variances".
314  * - <tt>statistic FLOAT8</tt> - Statistic
315  * \f[
316  * t = \frac{\bar x - \bar y}{\sqrt{s_X^2/n + s_Y^2/m}}
317  * \f]
318  * The corresponding random variable is approximately Student-t distributed
319  * with
320  * \f[
321  * \frac{(s_X^2 / n + s_Y^2 / m)^2}{(s_X^2 / n)^2/(n-1) + (s_Y^2 / m)^2/(m-1)}
322  * \f]
323  * degrees of freedom (Welch–Satterthwaite formula).
324  * - <tt>df FLOAT8</tt> - Degrees of freedom (as above)
325  * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
326  * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
327  * which is a lower bound on
328  * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
329  * Computed as
330  * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
331  * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
332  * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
333  * Computed as
334  * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
335  *
336  * @usage
337  * - Two-sample unpooled t-test: Test null hypothesis that the mean of the
338  * first sample is at most (or equal to, respectively) the mean of the second
339  * sample:
340  * <pre>SELECT (t_test_unpooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
341  */
342 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_unpooled(
343  /*+ "first" */ BOOLEAN,
344  /*+ "value" */ DOUBLE PRECISION) (
345 
346  SFUNC=MADLIB_SCHEMA.t_test_two_transition,
347  STYPE=DOUBLE PRECISION[],
348  FINALFUNC=MADLIB_SCHEMA.t_test_two_unpooled_final,
349  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
350  INITCOND='{0,0,0,0,0,0,0}'
351 );
352 
353 /**
354  * @brief Perform Fisher F-test
355  *
356  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_n \f$ of
357  * i.i.d. random variables \f$ X_1, \dots, X_m \sim N(\mu_X, \sigma^2) \f$ and
358  * \f$ Y_1, \dots, Y_n \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
359  * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
360  * \f$ H_0 : \sigma_X < \sigma_Y \f$ and \f$ H_0 : \sigma_X = \sigma_Y \f$.
361  *
362  * @param first Indicator whether \c value is from first sample
363  * \f$ x_1, \dots, x_m \f$ (if \c TRUE) or from second sample
364  * \f$ y_1, \dots, y_n \f$ (if \c FALSE)
365  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
366  *
367  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
368  * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
369  * \ref sample_variance "sample variances".
370  * - <tt>statistic FLOAT8</tt> - Statistic
371  * \f[
372  * f = \frac{s_Y^2}{s_X^2}
373  * \f]
374  * The corresponding random
375  * variable is F-distributed with
376  * \f$ (n - 1) \f$ degrees of freedom in the numerator and
377  * \f$ (m - 1) \f$ degrees of freedom in the denominator.
378  * - <tt>df1 BIGINT</tt> - Degrees of freedom in the numerator \f$ (n - 1) \f$
379  * - <tt>df2 BIGINT</tt> - Degrees of freedom in the denominator \f$ (m - 1) \f$
380  * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
381  * In detail, the result is \f$ \Pr[F \geq f \mid \sigma_X = \sigma_Y] \f$,
382  * which is a lower bound on
383  * \f$ \Pr[F \geq f \mid \sigma_X \leq \sigma_Y] \f$. Computed as
384  * <tt>(1.0 - \ref fisher_f_cdf "fisher_f_cdf"(statistic))</tt>.
385  * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
386  * \f$ 2 \cdot \min \{ p, 1 - p \} \f$ where
387  * \f$ p = \Pr[ F \geq f \mid \sigma_X = \sigma_Y] \f$. Computed as
388  * <tt>(min(p_value_one_sided, 1. - p_value_one_sided))</tt>.
389  *
390  * @usage
391  * - Test null hypothesis that the variance of the first sample is at most (or
392  * equal to, respectively) the variance of the second sample:
393  * <pre>SELECT (f_test(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
394  *
395  * @internal We reuse the two-sample t-test transition and merge functions.
396  */
397 CREATE AGGREGATE MADLIB_SCHEMA.f_test(
398  /*+ "first" */ BOOLEAN,
399  /*+ "value" */ DOUBLE PRECISION) (
400 
401  SFUNC=MADLIB_SCHEMA.t_test_two_transition,
402  STYPE=DOUBLE PRECISION[],
403  FINALFUNC=MADLIB_SCHEMA.f_test_final,
404  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
405  INITCOND='{0,0,0,0,0,0,0}'
406 );
407 
408 
409 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
410  state DOUBLE PRECISION[],
411  observed BIGINT,
412  expected DOUBLE PRECISION,
413  df BIGINT
414 ) RETURNS DOUBLE PRECISION[]
415 AS 'MODULE_PATHNAME'
416 LANGUAGE C
417 IMMUTABLE
418 STRICT;
420 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
421  state DOUBLE PRECISION[],
422  observed BIGINT,
423  expected DOUBLE PRECISION
424 ) RETURNS DOUBLE PRECISION[]
425 AS 'MODULE_PATHNAME'
426 LANGUAGE C
427 IMMUTABLE
428 STRICT;
429 
430 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
431  state DOUBLE PRECISION[],
432  observed BIGINT
433 ) RETURNS DOUBLE PRECISION[]
434 AS 'MODULE_PATHNAME'
435 LANGUAGE C
436 IMMUTABLE
437 STRICT;
438 
439 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_merge_states(
440  state1 DOUBLE PRECISION[],
441  state2 DOUBLE PRECISION[])
442 RETURNS DOUBLE PRECISION[]
443 AS 'MODULE_PATHNAME'
444 LANGUAGE C
445 IMMUTABLE
446 STRICT;
447 
448 CREATE TYPE MADLIB_SCHEMA.chi2_test_result AS (
449  statistic DOUBLE PRECISION,
450  p_value DOUBLE PRECISION,
451  df BIGINT,
452  phi DOUBLE PRECISION,
453  contingency_coef DOUBLE PRECISION
454 );
455 
456 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_final(
457  state DOUBLE PRECISION[]
458 ) RETURNS MADLIB_SCHEMA.chi2_test_result
459 AS 'MODULE_PATHNAME'
460 LANGUAGE C
461 IMMUTABLE
462 STRICT;
463 
464 /**
465  * @brief Perform Pearson's chi-squared goodness-of-fit test
466  *
467  * Let \f$ n_1, \dots, n_k \f$ be a realization of a (vector) random variable
468  * \f$ N = (N_1, \dots, N_k) \f$ that follows the multinomial distribution with
469  * parameters \f$ k \f$ and \f$ p = (p_1, \dots, p_k) \f$. Test the null
470  * hypothesis \f$ H_0 : p = p^0 \f$.
471  *
472  * @param observed Number \f$ n_i \f$ of observations of the current event/row
473  * @param expected Expected number of observations of current event/row. This
474  * number is not required to be normalized. That is, \f$ p^0_i \f$ will be
475  * taken as \c expected divided by <tt>sum(expected)</tt>. Hence, if this
476  * parameter is not specified, chi2_test() will by default use
477  * \f$ p^0 = (\frac 1k, \dots, \frac 1k) \f$, i.e., test that \f$ p \f$ is a
478  * discrete uniform distribution.
479  * @param df Degrees of freedom. This is the number of events reduced by the
480  * degree of freedom lost by using the observed numbers for defining the
481  * expected number of observations. If this parameter is 0, the degree
482  * of freedom is taken as \f$ (k - 1) \f$.
483  *
484  * @return A composite value as follows. Let \f$ n = \sum_{i=1}^n n_i \f$.
485  * - <tt>statistic FLOAT8</tt> - Statistic
486  * \f[
487  * \chi^2 = \sum_{i=1}^k \frac{(n_i - np_i)^2}{np_i}
488  * \f]
489  * The corresponding random
490  * variable is approximately chi-squared distributed with
491  * \c df degrees of freedom.
492  * - <tt>df BIGINT</tt> - Degrees of freedom
493  * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e.,
494  * \f$ \Pr[X^2 \geq \chi^2 \mid p = p^0] \f$. Computed as
495  * <tt>(1.0 - \ref chi_squared_cdf "chi_squared_cdf"(statistic))</tt>.
496  * - <tt>phi FLOAT8</tt> - Phi coefficient, i.e.,
497  * \f$ \phi = \sqrt{\frac{\chi^2}{n}} \f$
498  * - <tt>contingency_coef FLOAT8</tt> - Contingency coefficient, i.e.,
499  * \f$ \sqrt{\frac{\chi^2}{n + \chi^2}} \f$
500  *
501  * @usage
502  * - Test null hypothesis that all possible outcomes of a categorical variable
503  * are equally likely:
504  * <pre>SELECT (chi2_gof_test(<em>observed</em>, 1, NULL)).* FROM <em>source</em></pre>
505  * - Test null hypothesis that two categorical variables are independent.
506  * Such data is often shown in a <em>contingency table</em> (also known as
507  * \em crosstab). A crosstab is a matrix where possible values for the first
508  * variable correspond to rows and values for the second variable to
509  * columns. The matrix elements are the observation frequencies of the
510  * joint occurrence of the respective values.
511  * chi2_gof_test() assumes that the crosstab is stored in normalized form,
512  * i.e., there are three columns <tt><em>var1</em></tt>,
513  * <tt><em>var2</em></tt>, <tt><em>observed</em></tt>.
514  * <pre>SELECT (chi2_gof_test(<em>observed</em>, expected, deg_freedom)).*
515  *FROM (
516  * SELECT
517  * <em>observed</em>,
518  * sum(<em>observed</em>) OVER (PARTITION BY var1)::DOUBLE PRECISION
519  * * sum(<em>observed</em>) OVER (PARTITION BY var2) AS expected
520  * FROM <em>source</em>
521  *) p, (
522  * SELECT
523  * (count(DISTINCT <em>var1</em>) - 1) * (count(DISTINCT <em>var2</em>) - 1) AS deg_freedom
524  * FROM <em>source</em>
525  *) q;</pre>
526  */
527 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
528  /*+ observed */ BIGINT,
529  /*+ expected */ DOUBLE PRECISION /*+ DEFAULT 1 */,
530  /*+ df */ BIGINT /*+ DEFAULT 0 */
531 ) (
532  SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
533  STYPE=DOUBLE PRECISION[],
534  FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
535  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
536  INITCOND='{0,0,0,0,0,0}'
537 );
538 
539 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
540  /*+ observed */ BIGINT,
541  /*+ expected */ DOUBLE PRECISION
542 ) (
543  SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
544  STYPE=DOUBLE PRECISION[],
545  FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
546  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
547  INITCOND='{0,0,0,0,0,0,0}'
548 );
550 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
551  /*+ observed */ BIGINT
552 ) (
553  SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
554  STYPE=DOUBLE PRECISION[],
555  FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
556  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
557  INITCOND='{0,0,0,0,0,0,0}'
558 );
559 
560 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_transition(
561  state DOUBLE PRECISION[],
562  "first" BOOLEAN,
563  "value" DOUBLE PRECISION,
564  "numFirst" BIGINT,
565  "numSecond" BIGINT
566 ) RETURNS DOUBLE PRECISION[]
567 AS 'MODULE_PATHNAME'
568 LANGUAGE C
569 IMMUTABLE
570 STRICT;
571 
572 CREATE TYPE MADLIB_SCHEMA.ks_test_result AS (
573  statistic DOUBLE PRECISION,
574  k_statistic DOUBLE PRECISION,
575  p_value DOUBLE PRECISION
576 );
577 
578 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_final(
579  state DOUBLE PRECISION[])
580 RETURNS MADLIB_SCHEMA.ks_test_result
581 AS 'MODULE_PATHNAME'
582 LANGUAGE C IMMUTABLE STRICT;
583 
584 /**
585  * @brief Perform Kolmogorov-Smirnov test
586  *
587  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
588  * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
589  * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
590  * underlying distributions function \f$ F_X, F_Y \f$ are identical, i.e.,
591  * \f$ H_0 : F_X = F_Y \f$.
592  *
593  * @param first Determines whether the value belongs to the first
594  * (if \c TRUE) or the second sample (if \c FALSE)
595  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
596  * @param m Size \f$ m \f$ of the first sample. See usage instructions below.
597  * @param n Size of the second sample. See usage instructions below.
598  *
599  * @return A composite value.
600  * - <tt>statistic FLOAT8</tt> - Kolmogorov–Smirnov statistic
601  * \f[
602  * d = \max_{t \in \mathbb R} |F_x(t) - F_y(t)|
603  * \f]
604  * where \f$ F_x(t) := \frac 1m |\{ i \mid x_i \leq t \}| \f$ and
605  * \f$ F_y \f$ (defined likewise) are the empirical distribution functions.
606  * - <tt>k_statistic FLOAT8</tt> - Kolmogorov statistic
607  * \f$
608  * k = r + 0.12 + \frac{0.11}{r}
609  * \f$
610  * where
611  * \f$
612  * r = \sqrt{\frac{m n}{m+n}}.
613  * \f$
614  * Then \f$ k \f$ is approximately Kolmogorov distributed.
615  * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., an approximate value
616  * for \f$ \Pr[D \geq d \mid F_X = F_Y] \f$. Computed as
617  * <tt>(1.0 - \ref kolmogorov_cdf "kolmogorov_cdf"(k_statistic))</tt>.
618  *
619  * @usage
620  * - Test null hypothesis that two samples stem from the same distribution:
621  * <pre>SELECT (ks_test(<em>first</em>, <em>value</em>,
622  * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE <em>first</em>),
623  * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE NOT <em>first</em>)
624  * ORDER BY <em>value</em>
625  *)).* FROM <em>source</em></pre>
626  *
627  * @note
628  * This aggregate must be used as an ordered aggregate
629  * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
630  * not ordered.
631  */
632 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
633 CREATE
634 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
635 AGGREGATE MADLIB_SCHEMA.ks_test(
636  /*+ "first" */ BOOLEAN,
637  /*+ "value" */ DOUBLE PRECISION,
638  /*+ m */ BIGINT,
639  /*+ n */ BIGINT
640 ) (
641  SFUNC=MADLIB_SCHEMA.ks_test_transition,
642  STYPE=DOUBLE PRECISION[],
643  FINALFUNC=MADLIB_SCHEMA.ks_test_final,
644  INITCOND='{0,0,0,0,0,0,0}'
645 );
646 !>)
647 
648 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_transition(
649  state DOUBLE PRECISION[],
650  "first" BOOLEAN,
651  "value" DOUBLE PRECISION
652 ) RETURNS DOUBLE PRECISION[]
653 AS 'MODULE_PATHNAME'
654 LANGUAGE C
655 IMMUTABLE
656 STRICT;
657 
658 CREATE TYPE MADLIB_SCHEMA.mw_test_result AS (
659  statistic DOUBLE PRECISION,
660  u_statistic DOUBLE PRECISION,
661  p_value_one_sided DOUBLE PRECISION,
662  p_value_two_sided DOUBLE PRECISION
663 );
664 
665 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_final(
666  state DOUBLE PRECISION[])
667 RETURNS MADLIB_SCHEMA.mw_test_result
668 AS 'MODULE_PATHNAME'
669 LANGUAGE C IMMUTABLE STRICT;
670 
671 /**
672  * @brief Perform Mann-Whitney test
673  *
674  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
675  * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
676  * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
677  * underlying distributions are equal, i.e.,
678  * \f$ H_0 : \forall i,j: \Pr[X_i > Y_j] + \frac{\Pr[X_i = Y_j]}{2} = \frac 12 \f$.
679  *
680  * @param first Determines whether the value belongs to the first
681  * (if \c TRUE) or the second sample (if \c FALSE)
682  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
683  *
684  * @return A composite value.
685  * - <tt>statistic FLOAT8</tt> - Statistic
686  * \f[
687  * z = \frac{u - \bar x}{\sqrt{\frac{mn(m+n+1)}{12}}}
688  * \f]
689  * where \f$ u \f$ is the u-statistic computed as follows. The z-statistic
690  * is approximately standard normally distributed.
691  * - <tt>u_statistic FLOAT8</tt> - Statistic
692  * \f$ u = \min \{ u_x, u_y \} \f$ where
693  * \f[
694  * u_x = mn + \binom{m+1}{2} - \sum_{i=1}^m r_{x,i}
695  * \f]
696  * where
697  * \f[
698  * r_{x,i}
699  * = \{ j \mid x_j < x_i \} + \{ j \mid y_j < x_i \} +
700  * \frac{\{ j \mid x_j = x_i \} + \{ j \mid y_j = x_i \} + 1}{2}
701  * \f]
702  * is defined as the rank of \f$ x_i \f$ in the combined list of all
703  * \f$ m+n \f$ observations. For ties, the average rank of all equal values
704  * is used.
705  * - <tt>p_value_one_sided FLOAT8</tt> - Approximate one-sided p-value, i.e.,
706  * an approximate value for \f$ \Pr[Z \geq z \mid H_0] \f$. Computed as
707  * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
708  * - <tt>p_value_two_sided FLOAT8</tt> - Approximate two-sided p-value, i.e.,
709  * an approximate value for \f$ \Pr[|Z| \geq |z| \mid H_0] \f$. Computed as
710  * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
711  *
712  * @usage
713  * - Test null hypothesis that two samples stem from the same distribution:
714  * <pre>SELECT (mw_test(<em>first</em>, <em>value</em> ORDER BY <em>value</em>)).* FROM <em>source</em></pre>
715  *
716  * @note
717  * This aggregate must be used as an ordered aggregate
718  * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
719  * not ordered.
720  */
721 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
722 CREATE
723 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
724 AGGREGATE MADLIB_SCHEMA.mw_test(
725  /*+ "first" */ BOOLEAN,
726  /*+ "value" */ DOUBLE PRECISION
727 ) (
728  SFUNC=MADLIB_SCHEMA.mw_test_transition,
729  STYPE=DOUBLE PRECISION[],
730  FINALFUNC=MADLIB_SCHEMA.mw_test_final,
731  INITCOND='{0,0,0,0,0,0,0}'
732 );
733 !>)
734 
735 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
736  state DOUBLE PRECISION[],
737  value DOUBLE PRECISION,
738  "precision" DOUBLE PRECISION
739 ) RETURNS DOUBLE PRECISION[]
740 AS 'MODULE_PATHNAME'
741 LANGUAGE C
742 IMMUTABLE
743 STRICT;
745 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
746  state DOUBLE PRECISION[],
747  value DOUBLE PRECISION
748 ) RETURNS DOUBLE PRECISION[]
749 AS 'MODULE_PATHNAME'
750 LANGUAGE C
751 IMMUTABLE
752 STRICT;
753 
754 
755 CREATE TYPE MADLIB_SCHEMA.wsr_test_result AS (
756  statistic DOUBLE PRECISION,
757  rank_sum_pos FLOAT8,
758  rank_sum_neg FLOAT8,
759  num BIGINT,
760  z_statistic DOUBLE PRECISION,
761  p_value_one_sided DOUBLE PRECISION,
762  p_value_two_sided DOUBLE PRECISION
763 );
764 
765 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_final(
766  state DOUBLE PRECISION[])
767 RETURNS MADLIB_SCHEMA.wsr_test_result
768 AS 'MODULE_PATHNAME'
769 LANGUAGE C IMMUTABLE STRICT;
770 
771 /**
772  * @brief Perform Wilcoxon-Signed-Rank test
773  *
774  * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
775  * \f$ X_1, \dots, X_n \f$ with unknown mean \f$ \mu \f$, test the null
776  * hypotheses \f$ H_0 : \mu \leq 0 \f$ and \f$ H_0 : \mu = 0 \f$.
777  *
778  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$. Values of 0
779  * are ignored (i.e., they do not count towards \f$ n \f$).
780  * @param precision The precision \f$ \epsilon_i \f$ with which value is known.
781  * The precision determines the handling of ties. The current value
782  * \f$ v_i \f$ is regarded a tie with the previous value \f$ v_{i-1} \f$ if
783  * \f$ v_i - \epsilon_i \leq \max_{j=1, \dots, i-1} v_j + \epsilon_j \f$.
784  * If \c precision is negative, then it will be treated as
785  * <tt>value * 2^(-52)</tt>. (Note that \f$ 2^{-52} \f$ is the machine
786  * epsilon for type <tt>DOUBLE PRECISION</tt>.)
787  *
788  * @return A composite value:
789  * - <tt>statistic FLOAT8</tt> - statistic computed as follows. Let
790  * \f$
791  * w^+ = \sum_{i \mid x_i > 0} r_i
792  * \f$
793  * and
794  * \f$
795  * w^- = \sum_{i \mid x_i < 0} r_i
796  * \f$
797  * be the <em>signed rank sums</em> where
798  * \f[
799  * r_i
800  * = \{ j \mid |x_j| < |x_i| \}
801  * + \frac{\{ j \mid |x_j| = |x_i| \} + 1}{2}.
802  * \f]
803  * The Wilcoxon signed-rank statistic is \f$ w = \min \{ w^+, w^- \} \f$.
804  * - <tt>rank_sum_pos FLOAT8</tt> - rank sum of all positive values, i.e., \f$ w^+ \f$
805  * - <tt>rank_sum_neg FLOAT8</tt> - rank sum of all negative values, i.e., \f$ w^- \f$
806  * - <tt>num BIGINT</tt> - number \f$ n \f$ of non-zero values
807  * - <tt>z_statistic FLOAT8</tt> - z-statistic
808  * \f[
809  * z = \frac{w^+ - \frac{n(n+1)}{4}}
810  * {\sqrt{\frac{n(n+1)(2n+1)}{24}
811  * - \sum_{i=1}^n \frac{t_i^2 - 1}{48}}}
812  * \f]
813  * where \f$ t_i \f$ is the number of
814  * values with absolute value equal to \f$ |x_i| \f$. The corresponding
815  * random variable is approximately standard normally distributed.
816  * - <tt>p_value_one_sided FLOAT8</tt> - One-sided p-value i.e.,
817  * \f$ \Pr[Z \geq z \mid \mu \leq 0] \f$. Computed as
818  * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
819  * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
820  * \f$ \Pr[ |Z| \geq |z| \mid \mu = 0] \f$. Computed as
821  * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
822  *
823  * @usage
824  * - One-sample test: Test null hypothesis that the mean of a sample is at
825  * most (or equal to, respectively) \f$ \mu_0 \f$:
826  * <pre>SELECT (wsr_test(<em>value</em> - <em>mu_0</em> ORDER BY abs(<em>value</em>))).* FROM <em>source</em></pre>
827  * - Dependent paired test: Test null hypothesis that the mean difference
828  * between the first and second value in a pair is at most (or equal to,
829  * respectively) \f$ \mu_0 \f$:
830  * <pre>SELECT (wsr_test(<em>first</em> - <em>second</em> - <em>mu_0</em> ORDER BY abs(<em>first</em> - <em>second</em>))).* FROM <em>source</em></pre>
831  * If correctly determining ties is important (e.g., you may want to do so
832  * when comparing to software products that take \c first, \c second,
833  * and \c mu_0 as individual parameters), supply the precision parameter.
834  * This can be done as follows:
835  * <pre>SELECT (wsr_test(
836  <em>first</em> - <em>second</em> - <em>mu_0</em>,
837  3 * 2^(-52) * greatest(first, second, mu_0)
838  ORDER BY abs(<em>first</em> - <em>second</em>)
839 )).* FROM <em>source</em></pre>
840  * Here \f$ 2^{-52} \f$ is the machine epsilon, which we scale to the
841  * magnitude of the input data and multiply with 3 because we have a sum with
842  * three terms.
843  *
844  * @note
845  * This aggregate must be used as an ordered aggregate
846  * (<tt>ORDER BY abs(\em value</tt>)) and will raise an exception if the
847  * absolute values are not ordered.
848  */
849 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
850 CREATE
851 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
852 AGGREGATE MADLIB_SCHEMA.wsr_test(
853  /*+ "value" */ DOUBLE PRECISION,
854  /*+ "precision" */ DOUBLE PRECISION /*+ DEFAULT -1 */
855 ) (
856  SFUNC=MADLIB_SCHEMA.wsr_test_transition,
857  STYPE=DOUBLE PRECISION[],
858  FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
859  INITCOND='{0,0,0,0,0,0,0,0,0}'
860 );
861 !>)
862 
863 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
864 CREATE
865 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
866 AGGREGATE MADLIB_SCHEMA.wsr_test(
867  /*+ value */ DOUBLE PRECISION
868 ) (
869  SFUNC=MADLIB_SCHEMA.wsr_test_transition,
870  STYPE=DOUBLE PRECISION[],
871  FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
872  INITCOND='{0,0,0,0,0,0,0,0,0}'
873 );
874 !>)
875 
876 CREATE TYPE MADLIB_SCHEMA.one_way_anova_result AS (
877  sum_squares_between DOUBLE PRECISION,
878  sum_squares_within DOUBLE PRECISION,
879  df_between BIGINT,
880  df_within BIGINT,
881  mean_squares_between DOUBLE PRECISION,
882  mean_squares_within DOUBLE PRECISION,
883  statistic DOUBLE PRECISION,
884  p_value DOUBLE PRECISION
885 );
886 
887 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_transition(
888  state DOUBLE PRECISION[],
889  "group" INTEGER,
890  value DOUBLE PRECISION)
891 RETURNS DOUBLE PRECISION[]
892 AS 'MODULE_PATHNAME'
893 LANGUAGE C
894 IMMUTABLE
895 STRICT;
896 
897 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_merge_states(
898  state1 DOUBLE PRECISION[],
899  state2 DOUBLE PRECISION[])
900 RETURNS DOUBLE PRECISION[]
901 AS 'MODULE_PATHNAME'
902 LANGUAGE C
903 IMMUTABLE STRICT;
904 
905 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_final(
906  state DOUBLE PRECISION[])
907 RETURNS MADLIB_SCHEMA.one_way_anova_result
908 AS 'MODULE_PATHNAME'
909 LANGUAGE C IMMUTABLE STRICT;
910 
911 /**
912  * @brief Perform one-way analysis of variance
913  *
914  * Given realizations
915  * \f$ x_{1,1}, \dots, x_{1, n_1}, x_{2,1}, \dots, x_{2,n_2}, \dots, x_{k,n_k} \f$
916  * of i.i.d. random variables \f$ X_{i,j} \sim N(\mu_i, \sigma^2) \f$ with
917  * unknown parameters \f$ \mu_1, \dots, \mu_k \f$ and \f$ \sigma^2 \f$, test the
918  * null hypotheses \f$ H_0 : \mu_1 = \dots = \mu_k \f$.
919  *
920  * @param group Group which \c value is from. Note that \c group can assume
921  * arbitary value not limited to a continguous range of integers.
922  * @param value Value of random variate \f$ x_{i,j} \f$
923  *
924  * @return A composite value as follows. Let \f$ n := \sum_{i=1}^k n_i \f$ be
925  * the total size of all samples. Denote by \f$ \bar x \f$ the grand
926  * \ref sample_mean "mean", by \f$ \overline{x_i} \f$ the group
927  * \ref sample_mean "sample means", and by \f$ s_i^2 \f$ the group
928  * \ref sample_variance "sample variances".
929  * - <tt>sum_squares_between DOUBLE PRECISION</tt> - sum of squares between the
930  * group means, i.e.,
931  * \f$
932  * \mathit{SS}_b = \sum_{i=1}^k n_i (\overline{x_i} - \bar x)^2.
933  * \f$
934  * - <tt>sum_squares_within DOUBLE PRECISION</tt> - sum of squares within the
935  * groups, i.e.,
936  * \f$
937  * \mathit{SS}_w = \sum_{i=1}^k (n_i - 1) s_i^2.
938  * \f$
939  * - <tt>df_between BIGINT</tt> - degree of freedom for between-group variation \f$ (k-1) \f$
940  * - <tt>df_within BIGINT</tt> - degree of freedom for within-group variation \f$ (n-k) \f$
941  * - <tt>mean_squares_between DOUBLE PRECISION</tt> - mean square between
942  * groups, i.e.,
943  * \f$
944  * s_b^2 := \frac{\mathit{SS}_b}{k-1}
945  * \f$
946  * - <tt>mean_squares_within DOUBLE PRECISION</tt> - mean square within
947  * groups, i.e.,
948  * \f$
949  * s_w^2 := \frac{\mathit{SS}_w}{n-k}
950  * \f$
951  * - <tt>statistic DOUBLE PRECISION</tt> - Statistic computed as
952  * \f[
953  * f = \frac{s_b^2}{s_w^2}.
954  * \f]
955  * This statistic is Fisher F-distributed with \f$ (k-1) \f$ degrees of
956  * freedom in the numerator and \f$ (n-k) \f$ degrees of freedom in the
957  * denominator.
958  * - <tt>p_value DOUBLE PRECISION</tt> - p-value, i.e.,
959  * \f$ \Pr[ F \geq f \mid H_0] \f$.
960  *
961  * @usage
962  * - Test null hypothesis that the mean of the all samples is equal:
963  * <pre>SELECT (one_way_anova(<em>group</em>, <em>value</em>)).* FROM <em>source</em></pre>
964  */
965 CREATE AGGREGATE MADLIB_SCHEMA.one_way_anova(
966  /*+ group */ INTEGER,
967  /*+ value */ DOUBLE PRECISION) (
968 
969  SFUNC=MADLIB_SCHEMA.one_way_anova_transition,
970  STYPE=DOUBLE PRECISION[],
971  FINALFUNC=MADLIB_SCHEMA.one_way_anova_final,
972  m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.one_way_anova_merge_states,!>)
973  INITCOND='{0,0}'
974 );
975 
976 m4_changequote(<!`!>,<!'!>)