User Documentation
hypothesis_tests.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- */
00002 /**
00003  *
00004  * @file hypothesis_tests.sql_in
00005  *
00006  * @brief SQL functions for statistical hypothesis tests
00007  *
00008  * @sa For an overview of hypthesis-test functions, see the module
00009  *     description \ref grp_stats_tests.
00010  *
00011  */
00012  /* ----------------------------------------------------------------------- */
00013 
00014 m4_include(`SQLCommon.m4')
00015 m4_changequote(<!,!>)
00016 
00017 /**
00018 @addtogroup grp_stats_tests
00019 
00020 @about
00021 
00022 Hypothesis tests are used to confirm or reject a <em>“null” hypothesis</em>
00023 \f$ H_0 \f$ about the distribution of random variables, given realizations of
00024 these random variables. Since in general it is not possible to make statements
00025 with certainty, one is interested in the probability \f$ p \f$ of seeing random
00026 variates at least as extreme as the ones observed, assuming that \f$ H_0 \f$ is
00027 true. If this probability \f$ p \f$ is small, \f$ H_0 \f$ will be rejected by
00028 the test with <em>significance level</em> \f$ p \f$. Falsifying \f$ H_0 \f$ is
00029 the canonic goal when employing a hypothesis test. That is, hypothesis tests are
00030 typically used in order to substantiate that instead the <em>alternative
00031 hypothesis</em> \f$ H_1 \f$ is true.
00032 
00033 Hypothesis tests may be devided into parametric and non-parametric tests. A
00034 parametric test assumes certain distributions and makes inferences about
00035 parameters of the distributions (like, e.g., the mean of a normal distribution).
00036 Formally, there is a given domain of possible parameters \f$ \Gamma \f$ and the
00037 null hypothesis \f$ H_0 \f$ is the event that the true parameter
00038 \f$ \gamma_0 \in \Gamma_0 \f$, where \f$ \Gamma_0 \subsetneq \Gamma \f$.
00039 Non-parametric tests, on the other hand, do not assume any particular
00040 distribution of the sample (e.g., a non-parametric test may simply test if two
00041 distributions are similar).
00042 
00043 The first step of a hypothesis test is to compute a <em>test statistic</em>,
00044 which is a function of the random variates, i.e., a random variate itself.
00045 A hypothesis test relies on that the distribution of the test statistic is
00046 (approximately) known. Now, the \f$ p \f$-value is the probability of seeing a
00047 test statistic at least as extreme as the one observed, assuming that
00048 \f$ H_0 \f$ is true. In a case where the null hypothesis corresponds to a family
00049 of distributions (e.g., in a parametric test where \f$ \Gamma_0 \f$ is not a
00050 singleton set), the \f$ p \f$-value is the supremum, over all possible
00051 distributions according to the null hypothesis, of these probabilities.
00052 
00053 @input
00054 
00055 Input data is assumed to be normalized with all values stored row-wise. In
00056 general, the following inputs are expected.
00057 
00058 One-sample tests expect the following form:
00059 <pre>{TABLE|VIEW} <em>source</em> (
00060     ...
00061     <em>value</em> DOUBLE PRECISION
00062     ...
00063 )</pre>
00064 
00065 Two-sample tests expect the following form:
00066 <pre>{TABLE|VIEW} <em>source</em> (
00067     ...
00068     <em>first</em> BOOLEAN,
00069     <em>value</em> DOUBLE PRECISION
00070     ...
00071 )</pre>
00072 Here, \c first indicates whether a value is from the first (if \c TRUE) or the
00073 second sample (if \c FALSE).
00074 
00075 Many-sample tests expect the following form:
00076 <pre>{TABLE|VIEW} <em>source</em> (
00077     ...
00078     <em>group</em> INTEGER,
00079     <em>value</em> DOUBLE PRECISION
00080     ...
00081 )</pre>
00082 
00083 @usage
00084 
00085 All tests are implemented as aggregate functions. The non-parametric
00086 (rank-based) tests are implemented as ordered aggregate functions and thus
00087 necessitate an <tt>ORDER BY</tt> clause. In the following, the most simple
00088 forms of usage are given. Specific function signatures, as described in
00089 \ref hypothesis_tests.sql_in, may ask for more arguments or for a different
00090 <tt>ORDER BY</tt> clause.
00091 
00092 - Run a parametric one-sample test:
00093   <pre>SELECT <em>test</em>(<em>value</em>) FROM <em>source</em></pre>
00094 - Run a parametric two-sample test:
00095   <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em>) FROM <em>source</em></pre>
00096 - Run a non-parametric one-sample test:
00097   <pre>SELECT <em>test</em>(<em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre>
00098 - Run a non-parametric two-sample test:
00099   <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre>
00100 
00101 @examp
00102 
00103 See \ref hypothesis_tests.sql_in for examples for each of the aggregate
00104 functions.
00105 
00106 @literature
00107 
00108 [1] M. Hollander, D. Wolfe: <em>Nonparametric Statistical Methods</em>,
00109     2nd edition, Wiley, 1999
00110 
00111 [2] E. Lehmann, J. Romano: <em>Testing Statistical Hypotheses</em>, 3rd edition,
00112     Springer, 2005
00113 
00114 @sa File hypothesis_tests.sql_in documenting the SQL functions.
00115 */
00116 
00117 CREATE TYPE MADLIB_SCHEMA.t_test_result AS (
00118     statistic DOUBLE PRECISION,
00119     df DOUBLE PRECISION,
00120     p_value_one_sided DOUBLE PRECISION,
00121     p_value_two_sided DOUBLE PRECISION
00122 );
00123 
00124 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_transition(
00125     state DOUBLE PRECISION[],
00126     value DOUBLE PRECISION
00127 ) RETURNS DOUBLE PRECISION[]
00128 AS 'MODULE_PATHNAME'
00129 LANGUAGE C
00130 IMMUTABLE
00131 STRICT;
00132 
00133 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_merge_states(
00134     state1 DOUBLE PRECISION[],
00135     state2 DOUBLE PRECISION[])
00136 RETURNS DOUBLE PRECISION[]
00137 AS 'MODULE_PATHNAME'
00138 LANGUAGE C
00139 IMMUTABLE STRICT;
00140 
00141 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_final(
00142     state DOUBLE PRECISION[])
00143 RETURNS MADLIB_SCHEMA.t_test_result
00144 AS 'MODULE_PATHNAME'
00145 LANGUAGE C IMMUTABLE STRICT;
00146 
00147 CREATE TYPE MADLIB_SCHEMA.f_test_result AS (
00148     statistic DOUBLE PRECISION,
00149     df1 DOUBLE PRECISION,
00150     df2 DOUBLE PRECISION,
00151     p_value_one_sided DOUBLE PRECISION,
00152     p_value_two_sided DOUBLE PRECISION
00153 );
00154 
00155 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.f_test_final(
00156     state DOUBLE PRECISION[])
00157 RETURNS MADLIB_SCHEMA.f_test_result
00158 AS 'MODULE_PATHNAME'
00159 LANGUAGE C IMMUTABLE STRICT;
00160 
00161 
00162 /**
00163  * @brief Perform one-sample or dependent paired Student t-test
00164  *
00165  * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
00166  * \f$ X_1, \dots, X_n \sim N(\mu, \sigma^2) \f$ with unknown parameters \f$ \mu \f$ and
00167  * \f$ \sigma^2 \f$, test the null hypotheses \f$ H_0 : \mu \leq 0 \f$ and
00168  * \f$ H_0 : \mu = 0 \f$.
00169  *
00170  * @param value Value of random variate \f$ x_i \f$
00171  *
00172  * @return A composite value as follows. We denote by \f$ \bar x \f$ the
00173  *     \ref sample_mean "sample mean" and by \f$ s^2 \f$ the
00174  *     \ref sample_variance "sample variance".
00175  *  - <tt>statistic FLOAT8</tt> - Statistic
00176  *    \f[
00177  *        t = \frac{\sqrt n \cdot \bar x}{s}
00178  *    \f]
00179  *    The corresponding random
00180  *    variable is Student-t distributed with
00181  *    \f$ (n - 1) \f$ degrees of freedom.
00182  *  - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n - 1) \f$
00183  *  - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
00184  *    In detail, the result is \f$ \Pr[\bar X \geq \bar x \mid \mu = 0] \f$,
00185  *    which is a lower bound on
00186  *    \f$ \Pr[\bar X \geq \bar x \mid \mu \leq 0] \f$. Computed as
00187  *    <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
00188  *  - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
00189  *    \f$ \Pr[ |\bar X| \geq |\bar x| \mid \mu = 0] \f$. Computed as
00190  *    <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
00191  *
00192  * @usage
00193  *  - One-sample t-test: Test null hypothesis that the mean of a sample is at
00194  *    most (or equal to, respectively) \f$ \mu_0 \f$:
00195  *    <pre>SELECT (t_test_one(<em>value</em> - <em>mu_0</em>)).* FROM <em>source</em></pre>
00196  *  - Dependent paired t-test: Test null hypothesis that the mean difference
00197  *    between the first and second value in each pair is at most (or equal to,
00198  *    respectively) \f$ \mu_0 \f$:
00199  *    <pre>SELECT (t_test_one(<em>first</em> - <em>second</em> - <em>mu_0</em>)).*
00200  *               FROM <em>source</em></pre>
00201  */
00202 CREATE AGGREGATE MADLIB_SCHEMA.t_test_one(
00203     /*+ value */ DOUBLE PRECISION) (
00204 
00205     SFUNC=MADLIB_SCHEMA.t_test_one_transition,
00206     STYPE=DOUBLE PRECISION[],
00207     FINALFUNC=MADLIB_SCHEMA.t_test_one_final,
00208     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
00209     INITCOND='{0,0,0,0,0,0,0}'
00210 );
00211 
00212 
00213 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_transition(
00214     state DOUBLE PRECISION[],
00215     "first" BOOLEAN,
00216     "value" DOUBLE PRECISION)
00217 RETURNS DOUBLE PRECISION[]
00218 AS 'MODULE_PATHNAME'
00219 LANGUAGE C
00220 IMMUTABLE
00221 STRICT;
00222 
00223 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_pooled_final(
00224     state DOUBLE PRECISION[])
00225 RETURNS MADLIB_SCHEMA.t_test_result
00226 AS 'MODULE_PATHNAME'
00227 LANGUAGE C IMMUTABLE STRICT;
00228 
00229 /**
00230  * @brief Perform two-sample pooled (i.e., equal variances) Student t-test
00231  *
00232  * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
00233  * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma^2) \f$ and
00234  * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
00235  * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
00236  * \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
00237  *
00238  * @param first Indicator whether \c value is from first sample
00239  *     \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
00240  *     \f$ y_1, \dots, y_m \f$ (if \c FALSE)
00241  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
00242  *
00243  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
00244  *     the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
00245  *     \ref sample_variance "sample variances".
00246  *  - <tt>statistic FLOAT8</tt> - Statistic
00247  *    \f[
00248  *        t = \frac{\bar x - \bar y}{s_p \sqrt{1/n + 1/m}}
00249  *    \f]
00250  *    where
00251  *    \f[
00252  *        s_p^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2
00253  *                         + \sum_{i=1}^m (y_i - \bar y)^2}
00254  *                     {n + m - 2}
00255  *    \f]
00256  *    is the <em>pooled variance</em>.
00257  *    The corresponding random
00258  *    variable is Student-t distributed with
00259  *    \f$ (n + m - 2) \f$ degrees of freedom.
00260  *  - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n + m - 2) \f$
00261  *  - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
00262  *    In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
00263  *    which is a lower bound on
00264  *    \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
00265  *    Computed as
00266  *    <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
00267  *  - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
00268  *    \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
00269  *    Computed as
00270  *    <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
00271  *
00272  * @usage
00273  *  - Two-sample pooled t-test: Test null hypothesis that the mean of the first
00274  *    sample is at most (or equal to, respectively) the mean of the second
00275  *    sample:
00276  *    <pre>SELECT (t_test_pooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
00277  */
00278 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_pooled(
00279     /*+ "first" */ BOOLEAN,
00280     /*+ "value" */ DOUBLE PRECISION) (
00281 
00282     SFUNC=MADLIB_SCHEMA.t_test_two_transition,
00283     STYPE=DOUBLE PRECISION[],
00284     FINALFUNC=MADLIB_SCHEMA.t_test_two_pooled_final,
00285     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
00286     INITCOND='{0,0,0,0,0,0,0}'
00287 );
00288 
00289 
00290 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_unpooled_final(
00291     state DOUBLE PRECISION[])
00292 RETURNS MADLIB_SCHEMA.t_test_result
00293 AS 'MODULE_PATHNAME'
00294 LANGUAGE C IMMUTABLE STRICT;
00295 
00296 /**
00297  * @brief Perform unpooled (i.e., unequal variances) t-test (also known as
00298  *     Welch's t-test)
00299  *
00300  * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of
00301  * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma_X^2) \f$ and
00302  * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma_Y^2) \f$ with unknown parameters
00303  * \f$ \mu_X, \mu_Y, \sigma_X^2, \f$ and \f$ \sigma_Y^2 \f$, test the null
00304  * hypotheses \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$.
00305  *
00306  * @param first Indicator whether \c value is from first sample
00307  *     \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample
00308  *     \f$ y_1, \dots, y_m \f$ (if \c FALSE)
00309  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
00310  *
00311  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
00312  *     the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
00313  *     \ref sample_variance "sample variances".
00314  *  - <tt>statistic FLOAT8</tt> - Statistic
00315  *    \f[
00316  *        t = \frac{\bar x - \bar y}{\sqrt{s_X^2/n + s_Y^2/m}}
00317  *    \f]
00318  *    The corresponding random variable is approximately Student-t distributed
00319  *    with
00320  *    \f[
00321  *        \frac{(s_X^2 / n + s_Y^2 / m)^2}{(s_X^2 / n)^2/(n-1) + (s_Y^2 / m)^2/(m-1)}
00322  *    \f]
00323  *    degrees of freedom (Welch–Satterthwaite formula).
00324  *  - <tt>df FLOAT8</tt> - Degrees of freedom (as above)
00325  *  - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
00326  *    In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$,
00327  *    which is a lower bound on
00328  *    \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$.
00329  *    Computed as
00330  *    <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>.
00331  *  - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
00332  *    \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$.
00333  *    Computed as
00334  *    <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>.
00335  *
00336  * @usage
00337  *  - Two-sample unpooled t-test: Test null hypothesis that the mean of the
00338  *    first sample is at most (or equal to, respectively) the mean of the second
00339  *    sample:
00340  *    <pre>SELECT (t_test_unpooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
00341  */
00342 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_unpooled(
00343     /*+ "first" */ BOOLEAN,
00344     /*+ "value" */ DOUBLE PRECISION) (
00345 
00346     SFUNC=MADLIB_SCHEMA.t_test_two_transition,
00347     STYPE=DOUBLE PRECISION[],
00348     FINALFUNC=MADLIB_SCHEMA.t_test_two_unpooled_final,
00349     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
00350     INITCOND='{0,0,0,0,0,0,0}'
00351 );
00352 
00353 /**
00354  * @brief Perform Fisher F-test
00355  *
00356  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_n \f$ of
00357  * i.i.d. random variables \f$ X_1, \dots, X_m \sim N(\mu_X, \sigma^2) \f$ and
00358  * \f$ Y_1, \dots, Y_n \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters
00359  * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses
00360  * \f$ H_0 : \sigma_X < \sigma_Y \f$ and \f$ H_0 : \sigma_X = \sigma_Y \f$.
00361  *
00362  * @param first Indicator whether \c value is from first sample
00363  *     \f$ x_1, \dots, x_m \f$ (if \c TRUE) or from second sample
00364  *     \f$ y_1, \dots, y_n \f$ (if \c FALSE)
00365  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
00366  *
00367  * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$
00368  *     the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the
00369  *     \ref sample_variance "sample variances".
00370  *  - <tt>statistic FLOAT8</tt> - Statistic
00371  *    \f[
00372  *        f = \frac{s_Y^2}{s_X^2}
00373  *    \f]
00374  *    The corresponding random
00375  *    variable is F-distributed with
00376  *    \f$ (n - 1) \f$ degrees of freedom in the numerator and
00377  *    \f$ (m - 1) \f$ degrees of freedom in the denominator.
00378  *  - <tt>df1 BIGINT</tt> - Degrees of freedom in the numerator \f$ (n - 1) \f$
00379  *  - <tt>df2 BIGINT</tt> - Degrees of freedom in the denominator \f$ (m - 1) \f$
00380  *  - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value.
00381  *    In detail, the result is \f$ \Pr[F \geq f \mid \sigma_X = \sigma_Y] \f$,
00382  *    which is a lower bound on
00383  *    \f$ \Pr[F \geq f \mid \sigma_X \leq \sigma_Y] \f$. Computed as
00384  *    <tt>(1.0 - \ref fisher_f_cdf "fisher_f_cdf"(statistic))</tt>.
00385  *  - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
00386  *    \f$ 2 \cdot \min \{ p, 1 - p \} \f$ where
00387  *    \f$ p = \Pr[ F \geq f \mid \sigma_X = \sigma_Y] \f$. Computed as
00388  *    <tt>(min(p_value_one_sided, 1. - p_value_one_sided))</tt>.
00389  *
00390  * @usage
00391  *  - Test null hypothesis that the variance of the first sample is at most (or
00392  *    equal to, respectively) the variance of the second sample:
00393  *    <pre>SELECT (f_test(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre>
00394  *
00395  * @internal We reuse the two-sample t-test transition and merge functions.
00396  */
00397 CREATE AGGREGATE MADLIB_SCHEMA.f_test(
00398     /*+ "first" */ BOOLEAN,
00399     /*+ "value" */ DOUBLE PRECISION) (
00400 
00401     SFUNC=MADLIB_SCHEMA.t_test_two_transition,
00402     STYPE=DOUBLE PRECISION[],
00403     FINALFUNC=MADLIB_SCHEMA.f_test_final,
00404     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>)
00405     INITCOND='{0,0,0,0,0,0,0}'
00406 );
00407 
00408 
00409 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
00410     state DOUBLE PRECISION[],
00411     observed BIGINT,
00412     expected DOUBLE PRECISION,
00413     df BIGINT
00414 ) RETURNS DOUBLE PRECISION[]
00415 AS 'MODULE_PATHNAME'
00416 LANGUAGE C
00417 IMMUTABLE
00418 STRICT;
00419 
00420 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
00421     state DOUBLE PRECISION[],
00422     observed BIGINT,
00423     expected DOUBLE PRECISION
00424 ) RETURNS DOUBLE PRECISION[]
00425 AS 'MODULE_PATHNAME'
00426 LANGUAGE C
00427 IMMUTABLE
00428 STRICT;
00429 
00430 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition(
00431     state DOUBLE PRECISION[],
00432     observed BIGINT
00433 ) RETURNS DOUBLE PRECISION[]
00434 AS 'MODULE_PATHNAME'
00435 LANGUAGE C
00436 IMMUTABLE
00437 STRICT;
00438 
00439 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_merge_states(
00440     state1 DOUBLE PRECISION[],
00441     state2 DOUBLE PRECISION[])
00442 RETURNS DOUBLE PRECISION[]
00443 AS 'MODULE_PATHNAME'
00444 LANGUAGE C
00445 IMMUTABLE
00446 STRICT;
00447 
00448 CREATE TYPE MADLIB_SCHEMA.chi2_test_result AS (
00449     statistic DOUBLE PRECISION,
00450     p_value DOUBLE PRECISION,
00451     df BIGINT,
00452     phi DOUBLE PRECISION,
00453     contingency_coef DOUBLE PRECISION
00454 );
00455 
00456 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_final(
00457     state DOUBLE PRECISION[]
00458 ) RETURNS MADLIB_SCHEMA.chi2_test_result
00459 AS 'MODULE_PATHNAME'
00460 LANGUAGE C
00461 IMMUTABLE
00462 STRICT;
00463 
00464 /**
00465  * @brief Perform Pearson's chi-squared goodness-of-fit test
00466  *
00467  * Let \f$ n_1, \dots, n_k \f$ be a realization of a (vector) random variable
00468  * \f$ N = (N_1, \dots, N_k) \f$ that follows the multinomial distribution with
00469  * parameters \f$ k \f$ and \f$ p = (p_1, \dots, p_k) \f$. Test the null
00470  * hypothesis \f$ H_0 : p = p^0 \f$.
00471  *
00472  * @param observed Number \f$ n_i \f$ of observations of the current event/row
00473  * @param expected Expected number of observations of current event/row. This
00474  *     number is not required to be normalized. That is, \f$ p^0_i \f$ will be
00475  *     taken as \c expected divided by <tt>sum(expected)</tt>. Hence, if this
00476  *     parameter is not specified, chi2_test() will by default use
00477  *     \f$ p^0 = (\frac 1k, \dots, \frac 1k) \f$, i.e., test that \f$ p \f$ is a
00478  *     discrete uniform distribution.
00479  * @param df Degrees of freedom. This is the number of events reduced by the
00480  *     degree of freedom lost by using the observed numbers for defining the
00481  *     expected number of observations. If this parameter is 0, the degree
00482  *     of freedom is taken as \f$ (k - 1) \f$.
00483  *
00484  * @return A composite value as follows. Let \f$ n = \sum_{i=1}^n n_i \f$.
00485  *  - <tt>statistic FLOAT8</tt> - Statistic
00486  *    \f[
00487  *        \chi^2 = \sum_{i=1}^k \frac{(n_i - np_i)^2}{np_i}
00488  *    \f]
00489  *    The corresponding random
00490  *    variable is approximately chi-squared distributed with
00491  *    \c df degrees of freedom.
00492  *  - <tt>df BIGINT</tt> - Degrees of freedom
00493  *  - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e.,
00494  *    \f$ \Pr[X^2 \geq \chi^2 \mid p = p^0] \f$. Computed as
00495  *    <tt>(1.0 - \ref chi_squared_cdf "chi_squared_cdf"(statistic))</tt>.
00496  *  - <tt>phi FLOAT8</tt> - Phi coefficient, i.e.,
00497  *    \f$ \phi = \sqrt{\frac{\chi^2}{n}} \f$
00498  *  - <tt>contingency_coef FLOAT8</tt> - Contingency coefficient, i.e.,
00499  *    \f$ \sqrt{\frac{\chi^2}{n + \chi^2}} \f$
00500  *
00501  * @usage
00502  *  - Test null hypothesis that all possible outcomes of a categorical variable
00503  *    are equally likely:
00504  *    <pre>SELECT (chi2_gof_test(<em>observed</em>, 1, NULL)).* FROM <em>source</em></pre>
00505  *  - Test null hypothesis that two categorical variables are independent.
00506  *    Such data is often shown in a <em>contingency table</em> (also known as
00507  *    \em crosstab). A crosstab is a matrix where possible values for the first
00508  *    variable correspond to rows and values for the second variable to
00509  *    columns. The matrix elements are the observation frequencies of the
00510  *    joint occurrence of the respective values.
00511  *    chi2_gof_test() assumes that the crosstab is stored in normalized form,
00512  *    i.e., there are three columns <tt><em>var1</em></tt>,
00513  *    <tt><em>var2</em></tt>, <tt><em>observed</em></tt>.
00514  *    <pre>SELECT (chi2_gof_test(<em>observed</em>, expected, deg_freedom)).*
00515  *FROM (
00516  *    SELECT
00517  *        <em>observed</em>,
00518  *        sum(<em>observed</em>) OVER (PARTITION BY var1)::DOUBLE PRECISION
00519  *            * sum(<em>observed</em>) OVER (PARTITION BY var2) AS expected
00520  *    FROM <em>source</em>
00521  *) p, (
00522  *   SELECT
00523  *        (count(DISTINCT <em>var1</em>) - 1) * (count(DISTINCT <em>var2</em>) - 1) AS deg_freedom
00524  *    FROM <em>source</em>
00525  *) q;</pre>
00526  */
00527 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
00528     /*+ observed */ BIGINT,
00529     /*+ expected */ DOUBLE PRECISION /*+ DEFAULT 1 */,
00530     /*+ df */ BIGINT /*+ DEFAULT 0 */
00531 ) (
00532     SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
00533     STYPE=DOUBLE PRECISION[],
00534     FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
00535     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
00536     INITCOND='{0,0,0,0,0,0}'
00537 );
00538 
00539 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
00540     /*+ observed */ BIGINT,
00541     /*+ expected */ DOUBLE PRECISION
00542 ) (
00543     SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
00544     STYPE=DOUBLE PRECISION[],
00545     FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
00546     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
00547     INITCOND='{0,0,0,0,0,0,0}'
00548 );
00549 
00550 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test(
00551     /*+ observed */ BIGINT
00552 ) (
00553     SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition,
00554     STYPE=DOUBLE PRECISION[],
00555     FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final,
00556     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>)
00557     INITCOND='{0,0,0,0,0,0,0}'
00558 );
00559 
00560 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_transition(
00561     state DOUBLE PRECISION[],
00562     "first" BOOLEAN,
00563     "value" DOUBLE PRECISION,
00564     "numFirst" BIGINT,
00565     "numSecond" BIGINT
00566 ) RETURNS DOUBLE PRECISION[]
00567 AS 'MODULE_PATHNAME'
00568 LANGUAGE C
00569 IMMUTABLE
00570 STRICT;
00571 
00572 CREATE TYPE MADLIB_SCHEMA.ks_test_result AS (
00573     statistic DOUBLE PRECISION,
00574     k_statistic DOUBLE PRECISION,
00575     p_value DOUBLE PRECISION
00576 );
00577 
00578 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_final(
00579     state DOUBLE PRECISION[])
00580 RETURNS MADLIB_SCHEMA.ks_test_result
00581 AS 'MODULE_PATHNAME'
00582 LANGUAGE C IMMUTABLE STRICT;
00583 
00584 /**
00585  * @brief Perform Kolmogorov-Smirnov test
00586  *
00587  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
00588  * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
00589  * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
00590  * underlying distributions function \f$ F_X, F_Y \f$ are identical, i.e.,
00591  * \f$ H_0 : F_X = F_Y \f$.
00592  *
00593  * @param first Determines whether the value belongs to the first
00594  *     (if \c TRUE) or the second sample (if \c FALSE)
00595  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
00596  * @param m Size \f$ m \f$ of the first sample. See usage instructions below.
00597  * @param n Size of the second sample. See usage instructions below.
00598  *
00599  * @return A composite value.
00600  *  - <tt>statistic FLOAT8</tt> - Kolmogorov–Smirnov statistic
00601  *    \f[
00602  *        d = \max_{t \in \mathbb R} |F_x(t) - F_y(t)|
00603  *    \f]
00604  *    where \f$ F_x(t) := \frac 1m |\{ i \mid x_i \leq t \}| \f$ and
00605  *    \f$ F_y \f$ (defined likewise) are the empirical distribution functions.
00606  *  - <tt>k_statistic FLOAT8</tt> - Kolmogorov statistic
00607  *    \f$
00608  *        k = r + 0.12 + \frac{0.11}{r}
00609  *    \f$
00610  *    where
00611  *    \f$
00612  *        r = \sqrt{\frac{m n}{m+n}}.
00613  *    \f$
00614  *    Then \f$ k \f$ is approximately Kolmogorov distributed.
00615  *  - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., an approximate value
00616  *    for \f$ \Pr[D \geq d \mid F_X = F_Y] \f$. Computed as
00617  *    <tt>(1.0 - \ref kolmogorov_cdf "kolmogorov_cdf"(k_statistic))</tt>.
00618  *
00619  * @usage
00620  *  - Test null hypothesis that two samples stem from the same distribution:
00621  *    <pre>SELECT (ks_test(<em>first</em>, <em>value</em>,
00622  *    (SELECT count(<em>value</em>) FROM <em>source</em> WHERE <em>first</em>),
00623  *    (SELECT count(<em>value</em>) FROM <em>source</em> WHERE NOT <em>first</em>)
00624  *    ORDER BY <em>value</em>
00625  *)).* FROM <em>source</em></pre>
00626  *
00627  * @note
00628  *     This aggregate must be used as an ordered aggregate
00629  *     (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
00630  *     not ordered.
00631  */
00632 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
00633 CREATE
00634 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
00635 AGGREGATE MADLIB_SCHEMA.ks_test(
00636     /*+ "first" */ BOOLEAN,
00637     /*+ "value" */ DOUBLE PRECISION,
00638     /*+ m */ BIGINT,
00639     /*+ n */ BIGINT
00640 ) (
00641     SFUNC=MADLIB_SCHEMA.ks_test_transition,
00642     STYPE=DOUBLE PRECISION[],
00643     FINALFUNC=MADLIB_SCHEMA.ks_test_final,
00644     INITCOND='{0,0,0,0,0,0,0}'
00645 );
00646 !>)
00647 
00648 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_transition(
00649     state DOUBLE PRECISION[],
00650     "first" BOOLEAN,
00651     "value" DOUBLE PRECISION
00652 ) RETURNS DOUBLE PRECISION[]
00653 AS 'MODULE_PATHNAME'
00654 LANGUAGE C
00655 IMMUTABLE
00656 STRICT;
00657 
00658 CREATE TYPE MADLIB_SCHEMA.mw_test_result AS (
00659     statistic DOUBLE PRECISION,
00660     u_statistic DOUBLE PRECISION,
00661     p_value_one_sided DOUBLE PRECISION,
00662     p_value_two_sided DOUBLE PRECISION
00663 );
00664 
00665 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_final(
00666     state DOUBLE PRECISION[])
00667 RETURNS MADLIB_SCHEMA.mw_test_result
00668 AS 'MODULE_PATHNAME'
00669 LANGUAGE C IMMUTABLE STRICT;
00670 
00671 /**
00672  * @brief Perform Mann-Whitney test
00673  *
00674  * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of
00675  * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d.
00676  * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the
00677  * underlying distributions are equal, i.e.,
00678  * \f$ H_0 : \forall i,j: \Pr[X_i > Y_j] + \frac{\Pr[X_i = Y_j]}{2} = \frac 12 \f$.
00679  *
00680  * @param first Determines whether the value belongs to the first
00681  *     (if \c TRUE) or the second sample (if \c FALSE)
00682  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$
00683  *
00684  * @return A composite value.
00685  *  - <tt>statistic FLOAT8</tt> - Statistic
00686  *    \f[
00687  *        z = \frac{u - \bar x}{\sqrt{\frac{mn(m+n+1)}{12}}}
00688  *    \f]
00689  *    where \f$ u \f$ is the u-statistic computed as follows. The z-statistic
00690  *    is approximately standard normally distributed.
00691  *  - <tt>u_statistic FLOAT8</tt> - Statistic
00692  *    \f$ u = \min \{ u_x, u_y \} \f$ where
00693  *    \f[
00694  *        u_x = mn + \binom{m+1}{2} - \sum_{i=1}^m r_{x,i}
00695  *    \f]
00696  *    where
00697  *    \f[
00698  *        r_{x,i}
00699  *        =   \{ j \mid x_j < x_i \} + \{ j \mid y_j < x_i \} +
00700  *            \frac{\{ j \mid x_j = x_i \} + \{ j \mid y_j = x_i \} + 1}{2}
00701  *    \f]
00702  *    is defined as the rank of \f$ x_i \f$ in the combined list of all
00703  *    \f$ m+n \f$ observations. For ties, the average rank of all equal values
00704  *    is used.
00705  *  - <tt>p_value_one_sided FLOAT8</tt> - Approximate one-sided p-value, i.e.,
00706  *    an approximate value for \f$ \Pr[Z \geq z \mid H_0] \f$. Computed as
00707  *    <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
00708  *  - <tt>p_value_two_sided FLOAT8</tt> - Approximate two-sided p-value, i.e.,
00709  *    an approximate value for \f$ \Pr[|Z| \geq |z| \mid H_0] \f$. Computed as
00710  *    <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
00711  *
00712  * @usage
00713  *  - Test null hypothesis that two samples stem from the same distribution:
00714  *    <pre>SELECT (mw_test(<em>first</em>, <em>value</em> ORDER BY <em>value</em>)).* FROM <em>source</em></pre>
00715  *
00716  * @note
00717  *     This aggregate must be used as an ordered aggregate
00718  *     (<tt>ORDER BY \em value</tt>) and will raise an exception if values are
00719  *     not ordered.
00720  */
00721 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
00722 CREATE
00723 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
00724 AGGREGATE MADLIB_SCHEMA.mw_test(
00725     /*+ "first" */ BOOLEAN,
00726     /*+ "value" */ DOUBLE PRECISION
00727 ) (
00728     SFUNC=MADLIB_SCHEMA.mw_test_transition,
00729     STYPE=DOUBLE PRECISION[],
00730     FINALFUNC=MADLIB_SCHEMA.mw_test_final,
00731     INITCOND='{0,0,0,0,0,0,0}'
00732 );
00733 !>)
00734 
00735 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
00736     state DOUBLE PRECISION[],
00737     value DOUBLE PRECISION,
00738     "precision" DOUBLE PRECISION
00739 ) RETURNS DOUBLE PRECISION[]
00740 AS 'MODULE_PATHNAME'
00741 LANGUAGE C
00742 IMMUTABLE
00743 STRICT;
00744 
00745 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition(
00746     state DOUBLE PRECISION[],
00747     value DOUBLE PRECISION
00748 ) RETURNS DOUBLE PRECISION[]
00749 AS 'MODULE_PATHNAME'
00750 LANGUAGE C
00751 IMMUTABLE
00752 STRICT;
00753 
00754 
00755 CREATE TYPE MADLIB_SCHEMA.wsr_test_result AS (
00756     statistic DOUBLE PRECISION,
00757     rank_sum_pos FLOAT8,
00758     rank_sum_neg FLOAT8,
00759     num BIGINT,
00760     z_statistic DOUBLE PRECISION,
00761     p_value_one_sided DOUBLE PRECISION,
00762     p_value_two_sided DOUBLE PRECISION
00763 );
00764 
00765 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_final(
00766     state DOUBLE PRECISION[])
00767 RETURNS MADLIB_SCHEMA.wsr_test_result
00768 AS 'MODULE_PATHNAME'
00769 LANGUAGE C IMMUTABLE STRICT;
00770 
00771 /**
00772  * @brief Perform Wilcoxon-Signed-Rank test
00773  *
00774  * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables
00775  * \f$ X_1, \dots, X_n \f$ with unknown mean \f$ \mu \f$, test the null
00776  * hypotheses \f$ H_0 : \mu \leq 0 \f$ and \f$ H_0 : \mu = 0 \f$.
00777  *
00778  * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$. Values of 0
00779  *     are ignored (i.e., they do not count towards \f$ n \f$).
00780  * @param precision The precision \f$ \epsilon_i \f$ with which value is known.
00781  *     The precision determines the handling of ties. The current value
00782  *     \f$ v_i \f$ is regarded a tie with the previous value \f$ v_{i-1} \f$ if
00783  *     \f$ v_i - \epsilon_i \leq \max_{j=1, \dots, i-1} v_j + \epsilon_j \f$.
00784  *     If \c precision is negative, then it will be treated as
00785  *     <tt>value * 2^(-52)</tt>. (Note that \f$ 2^{-52} \f$ is the machine
00786  *     epsilon for type <tt>DOUBLE PRECISION</tt>.)
00787  *
00788  * @return A composite value:
00789  *  - <tt>statistic FLOAT8</tt> - statistic computed as follows. Let
00790  *    \f$
00791  *        w^+ = \sum_{i \mid x_i > 0} r_i
00792  *    \f$
00793  *    and
00794  *    \f$
00795  *        w^- = \sum_{i \mid x_i < 0} r_i
00796  *    \f$
00797  *    be the <em>signed rank sums</em> where
00798  *    \f[
00799  *        r_i
00800  *        =   \{ j \mid |x_j| < |x_i| \}
00801  *        +   \frac{\{ j \mid |x_j| = |x_i| \} + 1}{2}.
00802  *    \f]
00803  *    The Wilcoxon signed-rank statistic is \f$ w = \min \{ w^+, w^- \} \f$.
00804  *  - <tt>rank_sum_pos FLOAT8</tt> - rank sum of all positive values, i.e., \f$ w^+ \f$
00805  *  - <tt>rank_sum_neg FLOAT8</tt> - rank sum of all negative values, i.e., \f$ w^- \f$
00806  *  - <tt>num BIGINT</tt> - number \f$ n \f$ of non-zero values
00807  *  - <tt>z_statistic FLOAT8</tt> - z-statistic
00808  *    \f[
00809  *       z = \frac{w^+ - \frac{n(n+1)}{4}}
00810  *               {\sqrt{\frac{n(n+1)(2n+1)}{24}
00811  *                - \sum_{i=1}^n \frac{t_i^2 - 1}{48}}}
00812  *    \f]
00813  *    where \f$ t_i \f$ is the number of
00814  *    values with absolute value equal to \f$ |x_i| \f$. The corresponding
00815  *    random variable is approximately standard normally distributed.
00816  *  - <tt>p_value_one_sided FLOAT8</tt> - One-sided p-value i.e.,
00817  *    \f$ \Pr[Z \geq z \mid \mu \leq 0] \f$. Computed as
00818  *    <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>.
00819  *  - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e.,
00820  *    \f$ \Pr[ |Z| \geq |z| \mid \mu = 0] \f$. Computed as
00821  *    <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>.
00822  *
00823  * @usage
00824  *  - One-sample test: Test null hypothesis that the mean of a sample is at
00825  *    most (or equal to, respectively) \f$ \mu_0 \f$:
00826  *    <pre>SELECT (wsr_test(<em>value</em> - <em>mu_0</em> ORDER BY abs(<em>value</em>))).* FROM <em>source</em></pre>
00827  *  - Dependent paired test: Test null hypothesis that the mean difference
00828  *    between the first and second value in a pair is at most (or equal to,
00829  *    respectively) \f$ \mu_0 \f$:
00830  *    <pre>SELECT (wsr_test(<em>first</em> - <em>second</em> - <em>mu_0</em> ORDER BY abs(<em>first</em> - <em>second</em>))).* FROM <em>source</em></pre>
00831  *    If correctly determining ties is important (e.g., you may want to do so
00832  *    when comparing to software products that take \c first, \c second,
00833  *    and \c mu_0 as individual parameters), supply the precision parameter.
00834  *    This can be done as follows:
00835  *    <pre>SELECT (wsr_test(
00836     <em>first</em> - <em>second</em> - <em>mu_0</em>,
00837     3 * 2^(-52) * greatest(first, second, mu_0)
00838     ORDER BY abs(<em>first</em> - <em>second</em>)
00839 )).* FROM <em>source</em></pre>
00840  *    Here \f$ 2^{-52} \f$ is the machine epsilon, which we scale to the
00841  *    magnitude of the input data and multiply with 3 because we have a sum with
00842  *    three terms.
00843  *
00844  * @note
00845  *     This aggregate must be used as an ordered aggregate
00846  *     (<tt>ORDER BY abs(\em value</tt>)) and will raise an exception if the
00847  *     absolute values are not ordered.
00848  */
00849 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
00850 CREATE
00851 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
00852 AGGREGATE MADLIB_SCHEMA.wsr_test(
00853     /*+ "value" */ DOUBLE PRECISION,
00854     /*+ "precision" */ DOUBLE PRECISION /*+ DEFAULT -1 */
00855 ) (
00856     SFUNC=MADLIB_SCHEMA.wsr_test_transition,
00857     STYPE=DOUBLE PRECISION[],
00858     FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
00859     INITCOND='{0,0,0,0,0,0,0,0,0}'
00860 );
00861 !>)
00862 
00863 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
00864 CREATE
00865 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
00866 AGGREGATE MADLIB_SCHEMA.wsr_test(
00867     /*+ value */ DOUBLE PRECISION
00868 ) (
00869     SFUNC=MADLIB_SCHEMA.wsr_test_transition,
00870     STYPE=DOUBLE PRECISION[],
00871     FINALFUNC=MADLIB_SCHEMA.wsr_test_final,
00872     INITCOND='{0,0,0,0,0,0,0,0,0}'
00873 );
00874 !>)
00875 
00876 CREATE TYPE MADLIB_SCHEMA.one_way_anova_result AS (
00877     sum_squares_between DOUBLE PRECISION,
00878     sum_squares_within DOUBLE PRECISION,
00879     df_between BIGINT,
00880     df_within BIGINT,
00881     mean_squares_between DOUBLE PRECISION,
00882     mean_squares_within DOUBLE PRECISION,
00883     statistic DOUBLE PRECISION,
00884     p_value DOUBLE PRECISION
00885 );
00886 
00887 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_transition(
00888     state DOUBLE PRECISION[],
00889     "group" INTEGER,
00890     value DOUBLE PRECISION)
00891 RETURNS DOUBLE PRECISION[]
00892 AS 'MODULE_PATHNAME'
00893 LANGUAGE C
00894 IMMUTABLE
00895 STRICT;
00896 
00897 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_merge_states(
00898     state1 DOUBLE PRECISION[],
00899     state2 DOUBLE PRECISION[])
00900 RETURNS DOUBLE PRECISION[]
00901 AS 'MODULE_PATHNAME'
00902 LANGUAGE C
00903 IMMUTABLE STRICT;
00904 
00905 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_final(
00906     state DOUBLE PRECISION[])
00907 RETURNS MADLIB_SCHEMA.one_way_anova_result
00908 AS 'MODULE_PATHNAME'
00909 LANGUAGE C IMMUTABLE STRICT;
00910 
00911 /**
00912  * @brief Perform one-way analysis of variance
00913  *
00914  * Given realizations
00915  * \f$ x_{1,1}, \dots, x_{1, n_1}, x_{2,1}, \dots, x_{2,n_2}, \dots, x_{k,n_k} \f$
00916  * of i.i.d. random variables \f$ X_{i,j} \sim N(\mu_i, \sigma^2) \f$ with
00917  * unknown parameters \f$ \mu_1, \dots, \mu_k \f$ and \f$ \sigma^2 \f$, test the
00918  * null hypotheses \f$ H_0 : \mu_1 = \dots = \mu_k \f$.
00919  *
00920  * @param group Group which \c value is from. Note that \c group can assume
00921  *     arbitary value not limited to a continguous range of integers.
00922  * @param value Value of random variate \f$ x_{i,j} \f$
00923  *
00924  * @return A composite value as follows. Let \f$ n := \sum_{i=1}^k n_i \f$ be
00925  *     the total size of all samples. Denote by \f$ \bar x \f$ the grand
00926  *     \ref sample_mean "mean", by \f$ \overline{x_i} \f$ the group
00927  *     \ref sample_mean "sample means", and by \f$ s_i^2 \f$ the group
00928  *     \ref sample_variance "sample variances".
00929  *  - <tt>sum_squares_between DOUBLE PRECISION</tt> - sum of squares between the
00930  *    group means, i.e.,
00931  *    \f$
00932  *        \mathit{SS}_b = \sum_{i=1}^k n_i (\overline{x_i} - \bar x)^2.
00933  *    \f$
00934  *  - <tt>sum_squares_within DOUBLE PRECISION</tt> - sum of squares within the
00935  *    groups, i.e.,
00936  *    \f$
00937  *        \mathit{SS}_w = \sum_{i=1}^k (n_i - 1) s_i^2.
00938  *    \f$
00939  *  - <tt>df_between BIGINT</tt> - degree of freedom for between-group variation \f$ (k-1) \f$
00940  *  - <tt>df_within BIGINT</tt> - degree of freedom for within-group variation \f$ (n-k) \f$
00941  *  - <tt>mean_squares_between DOUBLE PRECISION</tt> - mean square between
00942  *    groups, i.e.,
00943  *    \f$
00944  *        s_b^2 := \frac{\mathit{SS}_b}{k-1}
00945  *    \f$
00946  *  - <tt>mean_squares_within DOUBLE PRECISION</tt> - mean square within
00947  *    groups, i.e.,
00948  *    \f$
00949  *        s_w^2 := \frac{\mathit{SS}_w}{n-k}
00950  *    \f$
00951  *  - <tt>statistic DOUBLE PRECISION</tt> - Statistic computed as
00952  *    \f[
00953  *        f = \frac{s_b^2}{s_w^2}.
00954  *    \f]
00955  *    This statistic is Fisher F-distributed with \f$ (k-1) \f$ degrees of
00956  *    freedom in the numerator and \f$ (n-k) \f$ degrees of freedom in the
00957  *    denominator.
00958  *  - <tt>p_value DOUBLE PRECISION</tt> - p-value, i.e.,
00959  *    \f$ \Pr[ F \geq f \mid H_0] \f$.
00960  *
00961  * @usage
00962  *  - Test null hypothesis that the mean of the all samples is equal:
00963  *    <pre>SELECT (one_way_anova(<em>group</em>, <em>value</em>)).* FROM <em>source</em></pre>
00964  */
00965 CREATE AGGREGATE MADLIB_SCHEMA.one_way_anova(
00966     /*+ group */ INTEGER,
00967     /*+ value */ DOUBLE PRECISION) (
00968 
00969     SFUNC=MADLIB_SCHEMA.one_way_anova_transition,
00970     STYPE=DOUBLE PRECISION[],
00971     FINALFUNC=MADLIB_SCHEMA.one_way_anova_final,
00972     m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.one_way_anova_merge_states,!>)
00973     INITCOND='{0,0}'
00974 );
00975 
00976 m4_changequote(<!`!>,<!'!>)