MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- */ 00002 /** 00003 * 00004 * @file hypothesis_tests.sql_in 00005 * 00006 * @brief SQL functions for statistical hypothesis tests 00007 * 00008 * @sa For an overview of hypthesis-test functions, see the module 00009 * description \ref grp_stats_tests. 00010 * 00011 */ 00012 /* ----------------------------------------------------------------------- */ 00013 00014 m4_include(`SQLCommon.m4') 00015 m4_changequote(<!,!>) 00016 00017 /** 00018 @addtogroup grp_stats_tests 00019 00020 @about 00021 00022 Hypothesis tests are used to confirm or reject a <em>“null” hypothesis</em> 00023 \f$ H_0 \f$ about the distribution of random variables, given realizations of 00024 these random variables. Since in general it is not possible to make statements 00025 with certainty, one is interested in the probability \f$ p \f$ of seeing random 00026 variates at least as extreme as the ones observed, assuming that \f$ H_0 \f$ is 00027 true. If this probability \f$ p \f$ is small, \f$ H_0 \f$ will be rejected by 00028 the test with <em>significance level</em> \f$ p \f$. Falsifying \f$ H_0 \f$ is 00029 the canonic goal when employing a hypothesis test. That is, hypothesis tests are 00030 typically used in order to substantiate that instead the <em>alternative 00031 hypothesis</em> \f$ H_1 \f$ is true. 00032 00033 Hypothesis tests may be devided into parametric and non-parametric tests. A 00034 parametric test assumes certain distributions and makes inferences about 00035 parameters of the distributions (like, e.g., the mean of a normal distribution). 00036 Formally, there is a given domain of possible parameters \f$ \Gamma \f$ and the 00037 null hypothesis \f$ H_0 \f$ is the event that the true parameter 00038 \f$ \gamma_0 \in \Gamma_0 \f$, where \f$ \Gamma_0 \subsetneq \Gamma \f$. 00039 Non-parametric tests, on the other hand, do not assume any particular 00040 distribution of the sample (e.g., a non-parametric test may simply test if two 00041 distributions are similar). 00042 00043 The first step of a hypothesis test is to compute a <em>test statistic</em>, 00044 which is a function of the random variates, i.e., a random variate itself. 00045 A hypothesis test relies on that the distribution of the test statistic is 00046 (approximately) known. Now, the \f$ p \f$-value is the probability of seeing a 00047 test statistic at least as extreme as the one observed, assuming that 00048 \f$ H_0 \f$ is true. In a case where the null hypothesis corresponds to a family 00049 of distributions (e.g., in a parametric test where \f$ \Gamma_0 \f$ is not a 00050 singleton set), the \f$ p \f$-value is the supremum, over all possible 00051 distributions according to the null hypothesis, of these probabilities. 00052 00053 @input 00054 00055 Input data is assumed to be normalized with all values stored row-wise. In 00056 general, the following inputs are expected. 00057 00058 One-sample tests expect the following form: 00059 <pre>{TABLE|VIEW} <em>source</em> ( 00060 ... 00061 <em>value</em> DOUBLE PRECISION 00062 ... 00063 )</pre> 00064 00065 Two-sample tests expect the following form: 00066 <pre>{TABLE|VIEW} <em>source</em> ( 00067 ... 00068 <em>first</em> BOOLEAN, 00069 <em>value</em> DOUBLE PRECISION 00070 ... 00071 )</pre> 00072 Here, \c first indicates whether a value is from the first (if \c TRUE) or the 00073 second sample (if \c FALSE). 00074 00075 Many-sample tests expect the following form: 00076 <pre>{TABLE|VIEW} <em>source</em> ( 00077 ... 00078 <em>group</em> INTEGER, 00079 <em>value</em> DOUBLE PRECISION 00080 ... 00081 )</pre> 00082 00083 @usage 00084 00085 All tests are implemented as aggregate functions. The non-parametric 00086 (rank-based) tests are implemented as ordered aggregate functions and thus 00087 necessitate an <tt>ORDER BY</tt> clause. In the following, the most simple 00088 forms of usage are given. Specific function signatures, as described in 00089 \ref hypothesis_tests.sql_in, may ask for more arguments or for a different 00090 <tt>ORDER BY</tt> clause. 00091 00092 - Run a parametric one-sample test: 00093 <pre>SELECT <em>test</em>(<em>value</em>) FROM <em>source</em></pre> 00094 - Run a parametric two-sample test: 00095 <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em>) FROM <em>source</em></pre> 00096 - Run a non-parametric one-sample test: 00097 <pre>SELECT <em>test</em>(<em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre> 00098 - Run a non-parametric two-sample test: 00099 <pre>SELECT <em>test</em>(<em>first</em>, <em>value</em> ORDER BY <em>value</em>) FROM <em>source</em></pre> 00100 00101 @examp 00102 00103 See \ref hypothesis_tests.sql_in for examples for each of the aggregate 00104 functions. 00105 00106 @literature 00107 00108 [1] M. Hollander, D. Wolfe: <em>Nonparametric Statistical Methods</em>, 00109 2nd edition, Wiley, 1999 00110 00111 [2] E. Lehmann, J. Romano: <em>Testing Statistical Hypotheses</em>, 3rd edition, 00112 Springer, 2005 00113 00114 @sa File hypothesis_tests.sql_in documenting the SQL functions. 00115 */ 00116 00117 CREATE TYPE MADLIB_SCHEMA.t_test_result AS ( 00118 statistic DOUBLE PRECISION, 00119 df DOUBLE PRECISION, 00120 p_value_one_sided DOUBLE PRECISION, 00121 p_value_two_sided DOUBLE PRECISION 00122 ); 00123 00124 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_transition( 00125 state DOUBLE PRECISION[], 00126 value DOUBLE PRECISION 00127 ) RETURNS DOUBLE PRECISION[] 00128 AS 'MODULE_PATHNAME' 00129 LANGUAGE C 00130 IMMUTABLE 00131 STRICT; 00132 00133 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_merge_states( 00134 state1 DOUBLE PRECISION[], 00135 state2 DOUBLE PRECISION[]) 00136 RETURNS DOUBLE PRECISION[] 00137 AS 'MODULE_PATHNAME' 00138 LANGUAGE C 00139 IMMUTABLE STRICT; 00140 00141 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_one_final( 00142 state DOUBLE PRECISION[]) 00143 RETURNS MADLIB_SCHEMA.t_test_result 00144 AS 'MODULE_PATHNAME' 00145 LANGUAGE C IMMUTABLE STRICT; 00146 00147 CREATE TYPE MADLIB_SCHEMA.f_test_result AS ( 00148 statistic DOUBLE PRECISION, 00149 df1 DOUBLE PRECISION, 00150 df2 DOUBLE PRECISION, 00151 p_value_one_sided DOUBLE PRECISION, 00152 p_value_two_sided DOUBLE PRECISION 00153 ); 00154 00155 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.f_test_final( 00156 state DOUBLE PRECISION[]) 00157 RETURNS MADLIB_SCHEMA.f_test_result 00158 AS 'MODULE_PATHNAME' 00159 LANGUAGE C IMMUTABLE STRICT; 00160 00161 00162 /** 00163 * @brief Perform one-sample or dependent paired Student t-test 00164 * 00165 * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables 00166 * \f$ X_1, \dots, X_n \sim N(\mu, \sigma^2) \f$ with unknown parameters \f$ \mu \f$ and 00167 * \f$ \sigma^2 \f$, test the null hypotheses \f$ H_0 : \mu \leq 0 \f$ and 00168 * \f$ H_0 : \mu = 0 \f$. 00169 * 00170 * @param value Value of random variate \f$ x_i \f$ 00171 * 00172 * @return A composite value as follows. We denote by \f$ \bar x \f$ the 00173 * \ref sample_mean "sample mean" and by \f$ s^2 \f$ the 00174 * \ref sample_variance "sample variance". 00175 * - <tt>statistic FLOAT8</tt> - Statistic 00176 * \f[ 00177 * t = \frac{\sqrt n \cdot \bar x}{s} 00178 * \f] 00179 * The corresponding random 00180 * variable is Student-t distributed with 00181 * \f$ (n - 1) \f$ degrees of freedom. 00182 * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n - 1) \f$ 00183 * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. 00184 * In detail, the result is \f$ \Pr[\bar X \geq \bar x \mid \mu = 0] \f$, 00185 * which is a lower bound on 00186 * \f$ \Pr[\bar X \geq \bar x \mid \mu \leq 0] \f$. Computed as 00187 * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. 00188 * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., 00189 * \f$ \Pr[ |\bar X| \geq |\bar x| \mid \mu = 0] \f$. Computed as 00190 * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. 00191 * 00192 * @usage 00193 * - One-sample t-test: Test null hypothesis that the mean of a sample is at 00194 * most (or equal to, respectively) \f$ \mu_0 \f$: 00195 * <pre>SELECT (t_test_one(<em>value</em> - <em>mu_0</em>)).* FROM <em>source</em></pre> 00196 * - Dependent paired t-test: Test null hypothesis that the mean difference 00197 * between the first and second value in each pair is at most (or equal to, 00198 * respectively) \f$ \mu_0 \f$: 00199 * <pre>SELECT (t_test_one(<em>first</em> - <em>second</em> - <em>mu_0</em>)).* 00200 * FROM <em>source</em></pre> 00201 */ 00202 CREATE AGGREGATE MADLIB_SCHEMA.t_test_one( 00203 /*+ value */ DOUBLE PRECISION) ( 00204 00205 SFUNC=MADLIB_SCHEMA.t_test_one_transition, 00206 STYPE=DOUBLE PRECISION[], 00207 FINALFUNC=MADLIB_SCHEMA.t_test_one_final, 00208 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) 00209 INITCOND='{0,0,0,0,0,0,0}' 00210 ); 00211 00212 00213 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_transition( 00214 state DOUBLE PRECISION[], 00215 "first" BOOLEAN, 00216 "value" DOUBLE PRECISION) 00217 RETURNS DOUBLE PRECISION[] 00218 AS 'MODULE_PATHNAME' 00219 LANGUAGE C 00220 IMMUTABLE 00221 STRICT; 00222 00223 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_pooled_final( 00224 state DOUBLE PRECISION[]) 00225 RETURNS MADLIB_SCHEMA.t_test_result 00226 AS 'MODULE_PATHNAME' 00227 LANGUAGE C IMMUTABLE STRICT; 00228 00229 /** 00230 * @brief Perform two-sample pooled (i.e., equal variances) Student t-test 00231 * 00232 * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of 00233 * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma^2) \f$ and 00234 * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters 00235 * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses 00236 * \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$. 00237 * 00238 * @param first Indicator whether \c value is from first sample 00239 * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample 00240 * \f$ y_1, \dots, y_m \f$ (if \c FALSE) 00241 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ 00242 * 00243 * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ 00244 * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the 00245 * \ref sample_variance "sample variances". 00246 * - <tt>statistic FLOAT8</tt> - Statistic 00247 * \f[ 00248 * t = \frac{\bar x - \bar y}{s_p \sqrt{1/n + 1/m}} 00249 * \f] 00250 * where 00251 * \f[ 00252 * s_p^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2 00253 * + \sum_{i=1}^m (y_i - \bar y)^2} 00254 * {n + m - 2} 00255 * \f] 00256 * is the <em>pooled variance</em>. 00257 * The corresponding random 00258 * variable is Student-t distributed with 00259 * \f$ (n + m - 2) \f$ degrees of freedom. 00260 * - <tt>df FLOAT8</tt> - Degrees of freedom \f$ (n + m - 2) \f$ 00261 * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. 00262 * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$, 00263 * which is a lower bound on 00264 * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$. 00265 * Computed as 00266 * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. 00267 * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., 00268 * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$. 00269 * Computed as 00270 * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. 00271 * 00272 * @usage 00273 * - Two-sample pooled t-test: Test null hypothesis that the mean of the first 00274 * sample is at most (or equal to, respectively) the mean of the second 00275 * sample: 00276 * <pre>SELECT (t_test_pooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> 00277 */ 00278 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_pooled( 00279 /*+ "first" */ BOOLEAN, 00280 /*+ "value" */ DOUBLE PRECISION) ( 00281 00282 SFUNC=MADLIB_SCHEMA.t_test_two_transition, 00283 STYPE=DOUBLE PRECISION[], 00284 FINALFUNC=MADLIB_SCHEMA.t_test_two_pooled_final, 00285 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) 00286 INITCOND='{0,0,0,0,0,0,0}' 00287 ); 00288 00289 00290 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.t_test_two_unpooled_final( 00291 state DOUBLE PRECISION[]) 00292 RETURNS MADLIB_SCHEMA.t_test_result 00293 AS 'MODULE_PATHNAME' 00294 LANGUAGE C IMMUTABLE STRICT; 00295 00296 /** 00297 * @brief Perform unpooled (i.e., unequal variances) t-test (also known as 00298 * Welch's t-test) 00299 * 00300 * Given realizations \f$ x_1, \dots, x_n \f$ and \f$ y_1, \dots, y_m \f$ of 00301 * i.i.d. random variables \f$ X_1, \dots, X_n \sim N(\mu_X, \sigma_X^2) \f$ and 00302 * \f$ Y_1, \dots, Y_m \sim N(\mu_Y, \sigma_Y^2) \f$ with unknown parameters 00303 * \f$ \mu_X, \mu_Y, \sigma_X^2, \f$ and \f$ \sigma_Y^2 \f$, test the null 00304 * hypotheses \f$ H_0 : \mu_X \leq \mu_Y \f$ and \f$ H_0 : \mu_X = \mu_Y \f$. 00305 * 00306 * @param first Indicator whether \c value is from first sample 00307 * \f$ x_1, \dots, x_n \f$ (if \c TRUE) or from second sample 00308 * \f$ y_1, \dots, y_m \f$ (if \c FALSE) 00309 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ 00310 * 00311 * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ 00312 * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the 00313 * \ref sample_variance "sample variances". 00314 * - <tt>statistic FLOAT8</tt> - Statistic 00315 * \f[ 00316 * t = \frac{\bar x - \bar y}{\sqrt{s_X^2/n + s_Y^2/m}} 00317 * \f] 00318 * The corresponding random variable is approximately Student-t distributed 00319 * with 00320 * \f[ 00321 * \frac{(s_X^2 / n + s_Y^2 / m)^2}{(s_X^2 / n)^2/(n-1) + (s_Y^2 / m)^2/(m-1)} 00322 * \f] 00323 * degrees of freedom (Welch–Satterthwaite formula). 00324 * - <tt>df FLOAT8</tt> - Degrees of freedom (as above) 00325 * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. 00326 * In detail, the result is \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X = \mu_Y] \f$, 00327 * which is a lower bound on 00328 * \f$ \Pr[\bar X - \bar Y \geq \bar x - \bar y \mid \mu_X \leq \mu_Y] \f$. 00329 * Computed as 00330 * <tt>(1.0 - \ref students_t_cdf "students_t_cdf"(statistic))</tt>. 00331 * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., 00332 * \f$ \Pr[ |\bar X - \bar Y| \geq |\bar x - \bar y| \mid \mu_X = \mu_Y] \f$. 00333 * Computed as 00334 * <tt>(2 * \ref students_t_cdf "students_t_cdf"(-abs(statistic)))</tt>. 00335 * 00336 * @usage 00337 * - Two-sample unpooled t-test: Test null hypothesis that the mean of the 00338 * first sample is at most (or equal to, respectively) the mean of the second 00339 * sample: 00340 * <pre>SELECT (t_test_unpooled(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> 00341 */ 00342 CREATE AGGREGATE MADLIB_SCHEMA.t_test_two_unpooled( 00343 /*+ "first" */ BOOLEAN, 00344 /*+ "value" */ DOUBLE PRECISION) ( 00345 00346 SFUNC=MADLIB_SCHEMA.t_test_two_transition, 00347 STYPE=DOUBLE PRECISION[], 00348 FINALFUNC=MADLIB_SCHEMA.t_test_two_unpooled_final, 00349 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) 00350 INITCOND='{0,0,0,0,0,0,0}' 00351 ); 00352 00353 /** 00354 * @brief Perform Fisher F-test 00355 * 00356 * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_n \f$ of 00357 * i.i.d. random variables \f$ X_1, \dots, X_m \sim N(\mu_X, \sigma^2) \f$ and 00358 * \f$ Y_1, \dots, Y_n \sim N(\mu_Y, \sigma^2) \f$ with unknown parameters 00359 * \f$ \mu_X, \mu_Y, \f$ and \f$ \sigma^2 \f$, test the null hypotheses 00360 * \f$ H_0 : \sigma_X < \sigma_Y \f$ and \f$ H_0 : \sigma_X = \sigma_Y \f$. 00361 * 00362 * @param first Indicator whether \c value is from first sample 00363 * \f$ x_1, \dots, x_m \f$ (if \c TRUE) or from second sample 00364 * \f$ y_1, \dots, y_n \f$ (if \c FALSE) 00365 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ 00366 * 00367 * @return A composite value as follows. We denote by \f$ \bar x, \bar y \f$ 00368 * the \ref sample_mean "sample means" and by \f$ s_X^2, s_Y^2 \f$ the 00369 * \ref sample_variance "sample variances". 00370 * - <tt>statistic FLOAT8</tt> - Statistic 00371 * \f[ 00372 * f = \frac{s_Y^2}{s_X^2} 00373 * \f] 00374 * The corresponding random 00375 * variable is F-distributed with 00376 * \f$ (n - 1) \f$ degrees of freedom in the numerator and 00377 * \f$ (m - 1) \f$ degrees of freedom in the denominator. 00378 * - <tt>df1 BIGINT</tt> - Degrees of freedom in the numerator \f$ (n - 1) \f$ 00379 * - <tt>df2 BIGINT</tt> - Degrees of freedom in the denominator \f$ (m - 1) \f$ 00380 * - <tt>p_value_one_sided FLOAT8</tt> - Lower bound on one-sided p-value. 00381 * In detail, the result is \f$ \Pr[F \geq f \mid \sigma_X = \sigma_Y] \f$, 00382 * which is a lower bound on 00383 * \f$ \Pr[F \geq f \mid \sigma_X \leq \sigma_Y] \f$. Computed as 00384 * <tt>(1.0 - \ref fisher_f_cdf "fisher_f_cdf"(statistic))</tt>. 00385 * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., 00386 * \f$ 2 \cdot \min \{ p, 1 - p \} \f$ where 00387 * \f$ p = \Pr[ F \geq f \mid \sigma_X = \sigma_Y] \f$. Computed as 00388 * <tt>(min(p_value_one_sided, 1. - p_value_one_sided))</tt>. 00389 * 00390 * @usage 00391 * - Test null hypothesis that the variance of the first sample is at most (or 00392 * equal to, respectively) the variance of the second sample: 00393 * <pre>SELECT (f_test(<em>first</em>, <em>value</em>)).* FROM <em>source</em></pre> 00394 * 00395 * @internal We reuse the two-sample t-test transition and merge functions. 00396 */ 00397 CREATE AGGREGATE MADLIB_SCHEMA.f_test( 00398 /*+ "first" */ BOOLEAN, 00399 /*+ "value" */ DOUBLE PRECISION) ( 00400 00401 SFUNC=MADLIB_SCHEMA.t_test_two_transition, 00402 STYPE=DOUBLE PRECISION[], 00403 FINALFUNC=MADLIB_SCHEMA.f_test_final, 00404 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.t_test_merge_states,!>) 00405 INITCOND='{0,0,0,0,0,0,0}' 00406 ); 00407 00408 00409 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( 00410 state DOUBLE PRECISION[], 00411 observed BIGINT, 00412 expected DOUBLE PRECISION, 00413 df BIGINT 00414 ) RETURNS DOUBLE PRECISION[] 00415 AS 'MODULE_PATHNAME' 00416 LANGUAGE C 00417 IMMUTABLE 00418 STRICT; 00419 00420 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( 00421 state DOUBLE PRECISION[], 00422 observed BIGINT, 00423 expected DOUBLE PRECISION 00424 ) RETURNS DOUBLE PRECISION[] 00425 AS 'MODULE_PATHNAME' 00426 LANGUAGE C 00427 IMMUTABLE 00428 STRICT; 00429 00430 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_transition( 00431 state DOUBLE PRECISION[], 00432 observed BIGINT 00433 ) RETURNS DOUBLE PRECISION[] 00434 AS 'MODULE_PATHNAME' 00435 LANGUAGE C 00436 IMMUTABLE 00437 STRICT; 00438 00439 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_merge_states( 00440 state1 DOUBLE PRECISION[], 00441 state2 DOUBLE PRECISION[]) 00442 RETURNS DOUBLE PRECISION[] 00443 AS 'MODULE_PATHNAME' 00444 LANGUAGE C 00445 IMMUTABLE 00446 STRICT; 00447 00448 CREATE TYPE MADLIB_SCHEMA.chi2_test_result AS ( 00449 statistic DOUBLE PRECISION, 00450 p_value DOUBLE PRECISION, 00451 df BIGINT, 00452 phi DOUBLE PRECISION, 00453 contingency_coef DOUBLE PRECISION 00454 ); 00455 00456 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.chi2_gof_test_final( 00457 state DOUBLE PRECISION[] 00458 ) RETURNS MADLIB_SCHEMA.chi2_test_result 00459 AS 'MODULE_PATHNAME' 00460 LANGUAGE C 00461 IMMUTABLE 00462 STRICT; 00463 00464 /** 00465 * @brief Perform Pearson's chi-squared goodness-of-fit test 00466 * 00467 * Let \f$ n_1, \dots, n_k \f$ be a realization of a (vector) random variable 00468 * \f$ N = (N_1, \dots, N_k) \f$ that follows the multinomial distribution with 00469 * parameters \f$ k \f$ and \f$ p = (p_1, \dots, p_k) \f$. Test the null 00470 * hypothesis \f$ H_0 : p = p^0 \f$. 00471 * 00472 * @param observed Number \f$ n_i \f$ of observations of the current event/row 00473 * @param expected Expected number of observations of current event/row. This 00474 * number is not required to be normalized. That is, \f$ p^0_i \f$ will be 00475 * taken as \c expected divided by <tt>sum(expected)</tt>. Hence, if this 00476 * parameter is not specified, chi2_test() will by default use 00477 * \f$ p^0 = (\frac 1k, \dots, \frac 1k) \f$, i.e., test that \f$ p \f$ is a 00478 * discrete uniform distribution. 00479 * @param df Degrees of freedom. This is the number of events reduced by the 00480 * degree of freedom lost by using the observed numbers for defining the 00481 * expected number of observations. If this parameter is 0, the degree 00482 * of freedom is taken as \f$ (k - 1) \f$. 00483 * 00484 * @return A composite value as follows. Let \f$ n = \sum_{i=1}^n n_i \f$. 00485 * - <tt>statistic FLOAT8</tt> - Statistic 00486 * \f[ 00487 * \chi^2 = \sum_{i=1}^k \frac{(n_i - np_i)^2}{np_i} 00488 * \f] 00489 * The corresponding random 00490 * variable is approximately chi-squared distributed with 00491 * \c df degrees of freedom. 00492 * - <tt>df BIGINT</tt> - Degrees of freedom 00493 * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., 00494 * \f$ \Pr[X^2 \geq \chi^2 \mid p = p^0] \f$. Computed as 00495 * <tt>(1.0 - \ref chi_squared_cdf "chi_squared_cdf"(statistic))</tt>. 00496 * - <tt>phi FLOAT8</tt> - Phi coefficient, i.e., 00497 * \f$ \phi = \sqrt{\frac{\chi^2}{n}} \f$ 00498 * - <tt>contingency_coef FLOAT8</tt> - Contingency coefficient, i.e., 00499 * \f$ \sqrt{\frac{\chi^2}{n + \chi^2}} \f$ 00500 * 00501 * @usage 00502 * - Test null hypothesis that all possible outcomes of a categorical variable 00503 * are equally likely: 00504 * <pre>SELECT (chi2_gof_test(<em>observed</em>, 1, NULL)).* FROM <em>source</em></pre> 00505 * - Test null hypothesis that two categorical variables are independent. 00506 * Such data is often shown in a <em>contingency table</em> (also known as 00507 * \em crosstab). A crosstab is a matrix where possible values for the first 00508 * variable correspond to rows and values for the second variable to 00509 * columns. The matrix elements are the observation frequencies of the 00510 * joint occurrence of the respective values. 00511 * chi2_gof_test() assumes that the crosstab is stored in normalized form, 00512 * i.e., there are three columns <tt><em>var1</em></tt>, 00513 * <tt><em>var2</em></tt>, <tt><em>observed</em></tt>. 00514 * <pre>SELECT (chi2_gof_test(<em>observed</em>, expected, deg_freedom)).* 00515 *FROM ( 00516 * SELECT 00517 * <em>observed</em>, 00518 * sum(<em>observed</em>) OVER (PARTITION BY var1)::DOUBLE PRECISION 00519 * * sum(<em>observed</em>) OVER (PARTITION BY var2) AS expected 00520 * FROM <em>source</em> 00521 *) p, ( 00522 * SELECT 00523 * (count(DISTINCT <em>var1</em>) - 1) * (count(DISTINCT <em>var2</em>) - 1) AS deg_freedom 00524 * FROM <em>source</em> 00525 *) q;</pre> 00526 */ 00527 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( 00528 /*+ observed */ BIGINT, 00529 /*+ expected */ DOUBLE PRECISION /*+ DEFAULT 1 */, 00530 /*+ df */ BIGINT /*+ DEFAULT 0 */ 00531 ) ( 00532 SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, 00533 STYPE=DOUBLE PRECISION[], 00534 FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, 00535 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) 00536 INITCOND='{0,0,0,0,0,0}' 00537 ); 00538 00539 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( 00540 /*+ observed */ BIGINT, 00541 /*+ expected */ DOUBLE PRECISION 00542 ) ( 00543 SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, 00544 STYPE=DOUBLE PRECISION[], 00545 FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, 00546 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) 00547 INITCOND='{0,0,0,0,0,0,0}' 00548 ); 00549 00550 CREATE AGGREGATE MADLIB_SCHEMA.chi2_gof_test( 00551 /*+ observed */ BIGINT 00552 ) ( 00553 SFUNC=MADLIB_SCHEMA.chi2_gof_test_transition, 00554 STYPE=DOUBLE PRECISION[], 00555 FINALFUNC=MADLIB_SCHEMA.chi2_gof_test_final, 00556 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.chi2_gof_test_merge_states,!>) 00557 INITCOND='{0,0,0,0,0,0,0}' 00558 ); 00559 00560 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_transition( 00561 state DOUBLE PRECISION[], 00562 "first" BOOLEAN, 00563 "value" DOUBLE PRECISION, 00564 "numFirst" BIGINT, 00565 "numSecond" BIGINT 00566 ) RETURNS DOUBLE PRECISION[] 00567 AS 'MODULE_PATHNAME' 00568 LANGUAGE C 00569 IMMUTABLE 00570 STRICT; 00571 00572 CREATE TYPE MADLIB_SCHEMA.ks_test_result AS ( 00573 statistic DOUBLE PRECISION, 00574 k_statistic DOUBLE PRECISION, 00575 p_value DOUBLE PRECISION 00576 ); 00577 00578 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.ks_test_final( 00579 state DOUBLE PRECISION[]) 00580 RETURNS MADLIB_SCHEMA.ks_test_result 00581 AS 'MODULE_PATHNAME' 00582 LANGUAGE C IMMUTABLE STRICT; 00583 00584 /** 00585 * @brief Perform Kolmogorov-Smirnov test 00586 * 00587 * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of 00588 * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d. 00589 * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the 00590 * underlying distributions function \f$ F_X, F_Y \f$ are identical, i.e., 00591 * \f$ H_0 : F_X = F_Y \f$. 00592 * 00593 * @param first Determines whether the value belongs to the first 00594 * (if \c TRUE) or the second sample (if \c FALSE) 00595 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ 00596 * @param m Size \f$ m \f$ of the first sample. See usage instructions below. 00597 * @param n Size of the second sample. See usage instructions below. 00598 * 00599 * @return A composite value. 00600 * - <tt>statistic FLOAT8</tt> - Kolmogorov–Smirnov statistic 00601 * \f[ 00602 * d = \max_{t \in \mathbb R} |F_x(t) - F_y(t)| 00603 * \f] 00604 * where \f$ F_x(t) := \frac 1m |\{ i \mid x_i \leq t \}| \f$ and 00605 * \f$ F_y \f$ (defined likewise) are the empirical distribution functions. 00606 * - <tt>k_statistic FLOAT8</tt> - Kolmogorov statistic 00607 * \f$ 00608 * k = r + 0.12 + \frac{0.11}{r} 00609 * \f$ 00610 * where 00611 * \f$ 00612 * r = \sqrt{\frac{m n}{m+n}}. 00613 * \f$ 00614 * Then \f$ k \f$ is approximately Kolmogorov distributed. 00615 * - <tt>p_value FLOAT8</tt> - Approximate p-value, i.e., an approximate value 00616 * for \f$ \Pr[D \geq d \mid F_X = F_Y] \f$. Computed as 00617 * <tt>(1.0 - \ref kolmogorov_cdf "kolmogorov_cdf"(k_statistic))</tt>. 00618 * 00619 * @usage 00620 * - Test null hypothesis that two samples stem from the same distribution: 00621 * <pre>SELECT (ks_test(<em>first</em>, <em>value</em>, 00622 * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE <em>first</em>), 00623 * (SELECT count(<em>value</em>) FROM <em>source</em> WHERE NOT <em>first</em>) 00624 * ORDER BY <em>value</em> 00625 *)).* FROM <em>source</em></pre> 00626 * 00627 * @note 00628 * This aggregate must be used as an ordered aggregate 00629 * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are 00630 * not ordered. 00631 */ 00632 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! 00633 CREATE 00634 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>) 00635 AGGREGATE MADLIB_SCHEMA.ks_test( 00636 /*+ "first" */ BOOLEAN, 00637 /*+ "value" */ DOUBLE PRECISION, 00638 /*+ m */ BIGINT, 00639 /*+ n */ BIGINT 00640 ) ( 00641 SFUNC=MADLIB_SCHEMA.ks_test_transition, 00642 STYPE=DOUBLE PRECISION[], 00643 FINALFUNC=MADLIB_SCHEMA.ks_test_final, 00644 INITCOND='{0,0,0,0,0,0,0}' 00645 ); 00646 !>) 00647 00648 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_transition( 00649 state DOUBLE PRECISION[], 00650 "first" BOOLEAN, 00651 "value" DOUBLE PRECISION 00652 ) RETURNS DOUBLE PRECISION[] 00653 AS 'MODULE_PATHNAME' 00654 LANGUAGE C 00655 IMMUTABLE 00656 STRICT; 00657 00658 CREATE TYPE MADLIB_SCHEMA.mw_test_result AS ( 00659 statistic DOUBLE PRECISION, 00660 u_statistic DOUBLE PRECISION, 00661 p_value_one_sided DOUBLE PRECISION, 00662 p_value_two_sided DOUBLE PRECISION 00663 ); 00664 00665 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mw_test_final( 00666 state DOUBLE PRECISION[]) 00667 RETURNS MADLIB_SCHEMA.mw_test_result 00668 AS 'MODULE_PATHNAME' 00669 LANGUAGE C IMMUTABLE STRICT; 00670 00671 /** 00672 * @brief Perform Mann-Whitney test 00673 * 00674 * Given realizations \f$ x_1, \dots, x_m \f$ and \f$ y_1, \dots, y_m \f$ of 00675 * i.i.d. random variables \f$ X_1, \dots, X_m \f$ and i.i.d. 00676 * \f$ Y_1, \dots, Y_n \f$, respectively, test the null hypothesis that the 00677 * underlying distributions are equal, i.e., 00678 * \f$ H_0 : \forall i,j: \Pr[X_i > Y_j] + \frac{\Pr[X_i = Y_j]}{2} = \frac 12 \f$. 00679 * 00680 * @param first Determines whether the value belongs to the first 00681 * (if \c TRUE) or the second sample (if \c FALSE) 00682 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$ 00683 * 00684 * @return A composite value. 00685 * - <tt>statistic FLOAT8</tt> - Statistic 00686 * \f[ 00687 * z = \frac{u - \bar x}{\sqrt{\frac{mn(m+n+1)}{12}}} 00688 * \f] 00689 * where \f$ u \f$ is the u-statistic computed as follows. The z-statistic 00690 * is approximately standard normally distributed. 00691 * - <tt>u_statistic FLOAT8</tt> - Statistic 00692 * \f$ u = \min \{ u_x, u_y \} \f$ where 00693 * \f[ 00694 * u_x = mn + \binom{m+1}{2} - \sum_{i=1}^m r_{x,i} 00695 * \f] 00696 * where 00697 * \f[ 00698 * r_{x,i} 00699 * = \{ j \mid x_j < x_i \} + \{ j \mid y_j < x_i \} + 00700 * \frac{\{ j \mid x_j = x_i \} + \{ j \mid y_j = x_i \} + 1}{2} 00701 * \f] 00702 * is defined as the rank of \f$ x_i \f$ in the combined list of all 00703 * \f$ m+n \f$ observations. For ties, the average rank of all equal values 00704 * is used. 00705 * - <tt>p_value_one_sided FLOAT8</tt> - Approximate one-sided p-value, i.e., 00706 * an approximate value for \f$ \Pr[Z \geq z \mid H_0] \f$. Computed as 00707 * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>. 00708 * - <tt>p_value_two_sided FLOAT8</tt> - Approximate two-sided p-value, i.e., 00709 * an approximate value for \f$ \Pr[|Z| \geq |z| \mid H_0] \f$. Computed as 00710 * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>. 00711 * 00712 * @usage 00713 * - Test null hypothesis that two samples stem from the same distribution: 00714 * <pre>SELECT (mw_test(<em>first</em>, <em>value</em> ORDER BY <em>value</em>)).* FROM <em>source</em></pre> 00715 * 00716 * @note 00717 * This aggregate must be used as an ordered aggregate 00718 * (<tt>ORDER BY \em value</tt>) and will raise an exception if values are 00719 * not ordered. 00720 */ 00721 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! 00722 CREATE 00723 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>) 00724 AGGREGATE MADLIB_SCHEMA.mw_test( 00725 /*+ "first" */ BOOLEAN, 00726 /*+ "value" */ DOUBLE PRECISION 00727 ) ( 00728 SFUNC=MADLIB_SCHEMA.mw_test_transition, 00729 STYPE=DOUBLE PRECISION[], 00730 FINALFUNC=MADLIB_SCHEMA.mw_test_final, 00731 INITCOND='{0,0,0,0,0,0,0}' 00732 ); 00733 !>) 00734 00735 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition( 00736 state DOUBLE PRECISION[], 00737 value DOUBLE PRECISION, 00738 "precision" DOUBLE PRECISION 00739 ) RETURNS DOUBLE PRECISION[] 00740 AS 'MODULE_PATHNAME' 00741 LANGUAGE C 00742 IMMUTABLE 00743 STRICT; 00744 00745 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_transition( 00746 state DOUBLE PRECISION[], 00747 value DOUBLE PRECISION 00748 ) RETURNS DOUBLE PRECISION[] 00749 AS 'MODULE_PATHNAME' 00750 LANGUAGE C 00751 IMMUTABLE 00752 STRICT; 00753 00754 00755 CREATE TYPE MADLIB_SCHEMA.wsr_test_result AS ( 00756 statistic DOUBLE PRECISION, 00757 rank_sum_pos FLOAT8, 00758 rank_sum_neg FLOAT8, 00759 num BIGINT, 00760 z_statistic DOUBLE PRECISION, 00761 p_value_one_sided DOUBLE PRECISION, 00762 p_value_two_sided DOUBLE PRECISION 00763 ); 00764 00765 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.wsr_test_final( 00766 state DOUBLE PRECISION[]) 00767 RETURNS MADLIB_SCHEMA.wsr_test_result 00768 AS 'MODULE_PATHNAME' 00769 LANGUAGE C IMMUTABLE STRICT; 00770 00771 /** 00772 * @brief Perform Wilcoxon-Signed-Rank test 00773 * 00774 * Given realizations \f$ x_1, \dots, x_n \f$ of i.i.d. random variables 00775 * \f$ X_1, \dots, X_n \f$ with unknown mean \f$ \mu \f$, test the null 00776 * hypotheses \f$ H_0 : \mu \leq 0 \f$ and \f$ H_0 : \mu = 0 \f$. 00777 * 00778 * @param value Value of random variate \f$ x_i \f$ or \f$ y_i \f$. Values of 0 00779 * are ignored (i.e., they do not count towards \f$ n \f$). 00780 * @param precision The precision \f$ \epsilon_i \f$ with which value is known. 00781 * The precision determines the handling of ties. The current value 00782 * \f$ v_i \f$ is regarded a tie with the previous value \f$ v_{i-1} \f$ if 00783 * \f$ v_i - \epsilon_i \leq \max_{j=1, \dots, i-1} v_j + \epsilon_j \f$. 00784 * If \c precision is negative, then it will be treated as 00785 * <tt>value * 2^(-52)</tt>. (Note that \f$ 2^{-52} \f$ is the machine 00786 * epsilon for type <tt>DOUBLE PRECISION</tt>.) 00787 * 00788 * @return A composite value: 00789 * - <tt>statistic FLOAT8</tt> - statistic computed as follows. Let 00790 * \f$ 00791 * w^+ = \sum_{i \mid x_i > 0} r_i 00792 * \f$ 00793 * and 00794 * \f$ 00795 * w^- = \sum_{i \mid x_i < 0} r_i 00796 * \f$ 00797 * be the <em>signed rank sums</em> where 00798 * \f[ 00799 * r_i 00800 * = \{ j \mid |x_j| < |x_i| \} 00801 * + \frac{\{ j \mid |x_j| = |x_i| \} + 1}{2}. 00802 * \f] 00803 * The Wilcoxon signed-rank statistic is \f$ w = \min \{ w^+, w^- \} \f$. 00804 * - <tt>rank_sum_pos FLOAT8</tt> - rank sum of all positive values, i.e., \f$ w^+ \f$ 00805 * - <tt>rank_sum_neg FLOAT8</tt> - rank sum of all negative values, i.e., \f$ w^- \f$ 00806 * - <tt>num BIGINT</tt> - number \f$ n \f$ of non-zero values 00807 * - <tt>z_statistic FLOAT8</tt> - z-statistic 00808 * \f[ 00809 * z = \frac{w^+ - \frac{n(n+1)}{4}} 00810 * {\sqrt{\frac{n(n+1)(2n+1)}{24} 00811 * - \sum_{i=1}^n \frac{t_i^2 - 1}{48}}} 00812 * \f] 00813 * where \f$ t_i \f$ is the number of 00814 * values with absolute value equal to \f$ |x_i| \f$. The corresponding 00815 * random variable is approximately standard normally distributed. 00816 * - <tt>p_value_one_sided FLOAT8</tt> - One-sided p-value i.e., 00817 * \f$ \Pr[Z \geq z \mid \mu \leq 0] \f$. Computed as 00818 * <tt>(1.0 - \ref normal_cdf "normal_cdf"(z_statistic))</tt>. 00819 * - <tt>p_value_two_sided FLOAT8</tt> - Two-sided p-value, i.e., 00820 * \f$ \Pr[ |Z| \geq |z| \mid \mu = 0] \f$. Computed as 00821 * <tt>(2 * \ref normal_cdf "normal_cdf"(-abs(z_statistic)))</tt>. 00822 * 00823 * @usage 00824 * - One-sample test: Test null hypothesis that the mean of a sample is at 00825 * most (or equal to, respectively) \f$ \mu_0 \f$: 00826 * <pre>SELECT (wsr_test(<em>value</em> - <em>mu_0</em> ORDER BY abs(<em>value</em>))).* FROM <em>source</em></pre> 00827 * - Dependent paired test: Test null hypothesis that the mean difference 00828 * between the first and second value in a pair is at most (or equal to, 00829 * respectively) \f$ \mu_0 \f$: 00830 * <pre>SELECT (wsr_test(<em>first</em> - <em>second</em> - <em>mu_0</em> ORDER BY abs(<em>first</em> - <em>second</em>))).* FROM <em>source</em></pre> 00831 * If correctly determining ties is important (e.g., you may want to do so 00832 * when comparing to software products that take \c first, \c second, 00833 * and \c mu_0 as individual parameters), supply the precision parameter. 00834 * This can be done as follows: 00835 * <pre>SELECT (wsr_test( 00836 <em>first</em> - <em>second</em> - <em>mu_0</em>, 00837 3 * 2^(-52) * greatest(first, second, mu_0) 00838 ORDER BY abs(<em>first</em> - <em>second</em>) 00839 )).* FROM <em>source</em></pre> 00840 * Here \f$ 2^{-52} \f$ is the machine epsilon, which we scale to the 00841 * magnitude of the input data and multiply with 3 because we have a sum with 00842 * three terms. 00843 * 00844 * @note 00845 * This aggregate must be used as an ordered aggregate 00846 * (<tt>ORDER BY abs(\em value</tt>)) and will raise an exception if the 00847 * absolute values are not ordered. 00848 */ 00849 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! 00850 CREATE 00851 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>) 00852 AGGREGATE MADLIB_SCHEMA.wsr_test( 00853 /*+ "value" */ DOUBLE PRECISION, 00854 /*+ "precision" */ DOUBLE PRECISION /*+ DEFAULT -1 */ 00855 ) ( 00856 SFUNC=MADLIB_SCHEMA.wsr_test_transition, 00857 STYPE=DOUBLE PRECISION[], 00858 FINALFUNC=MADLIB_SCHEMA.wsr_test_final, 00859 INITCOND='{0,0,0,0,0,0,0,0,0}' 00860 ); 00861 !>) 00862 00863 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<! 00864 CREATE 00865 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>) 00866 AGGREGATE MADLIB_SCHEMA.wsr_test( 00867 /*+ value */ DOUBLE PRECISION 00868 ) ( 00869 SFUNC=MADLIB_SCHEMA.wsr_test_transition, 00870 STYPE=DOUBLE PRECISION[], 00871 FINALFUNC=MADLIB_SCHEMA.wsr_test_final, 00872 INITCOND='{0,0,0,0,0,0,0,0,0}' 00873 ); 00874 !>) 00875 00876 CREATE TYPE MADLIB_SCHEMA.one_way_anova_result AS ( 00877 sum_squares_between DOUBLE PRECISION, 00878 sum_squares_within DOUBLE PRECISION, 00879 df_between BIGINT, 00880 df_within BIGINT, 00881 mean_squares_between DOUBLE PRECISION, 00882 mean_squares_within DOUBLE PRECISION, 00883 statistic DOUBLE PRECISION, 00884 p_value DOUBLE PRECISION 00885 ); 00886 00887 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_transition( 00888 state DOUBLE PRECISION[], 00889 "group" INTEGER, 00890 value DOUBLE PRECISION) 00891 RETURNS DOUBLE PRECISION[] 00892 AS 'MODULE_PATHNAME' 00893 LANGUAGE C 00894 IMMUTABLE 00895 STRICT; 00896 00897 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_merge_states( 00898 state1 DOUBLE PRECISION[], 00899 state2 DOUBLE PRECISION[]) 00900 RETURNS DOUBLE PRECISION[] 00901 AS 'MODULE_PATHNAME' 00902 LANGUAGE C 00903 IMMUTABLE STRICT; 00904 00905 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.one_way_anova_final( 00906 state DOUBLE PRECISION[]) 00907 RETURNS MADLIB_SCHEMA.one_way_anova_result 00908 AS 'MODULE_PATHNAME' 00909 LANGUAGE C IMMUTABLE STRICT; 00910 00911 /** 00912 * @brief Perform one-way analysis of variance 00913 * 00914 * Given realizations 00915 * \f$ x_{1,1}, \dots, x_{1, n_1}, x_{2,1}, \dots, x_{2,n_2}, \dots, x_{k,n_k} \f$ 00916 * of i.i.d. random variables \f$ X_{i,j} \sim N(\mu_i, \sigma^2) \f$ with 00917 * unknown parameters \f$ \mu_1, \dots, \mu_k \f$ and \f$ \sigma^2 \f$, test the 00918 * null hypotheses \f$ H_0 : \mu_1 = \dots = \mu_k \f$. 00919 * 00920 * @param group Group which \c value is from. Note that \c group can assume 00921 * arbitary value not limited to a continguous range of integers. 00922 * @param value Value of random variate \f$ x_{i,j} \f$ 00923 * 00924 * @return A composite value as follows. Let \f$ n := \sum_{i=1}^k n_i \f$ be 00925 * the total size of all samples. Denote by \f$ \bar x \f$ the grand 00926 * \ref sample_mean "mean", by \f$ \overline{x_i} \f$ the group 00927 * \ref sample_mean "sample means", and by \f$ s_i^2 \f$ the group 00928 * \ref sample_variance "sample variances". 00929 * - <tt>sum_squares_between DOUBLE PRECISION</tt> - sum of squares between the 00930 * group means, i.e., 00931 * \f$ 00932 * \mathit{SS}_b = \sum_{i=1}^k n_i (\overline{x_i} - \bar x)^2. 00933 * \f$ 00934 * - <tt>sum_squares_within DOUBLE PRECISION</tt> - sum of squares within the 00935 * groups, i.e., 00936 * \f$ 00937 * \mathit{SS}_w = \sum_{i=1}^k (n_i - 1) s_i^2. 00938 * \f$ 00939 * - <tt>df_between BIGINT</tt> - degree of freedom for between-group variation \f$ (k-1) \f$ 00940 * - <tt>df_within BIGINT</tt> - degree of freedom for within-group variation \f$ (n-k) \f$ 00941 * - <tt>mean_squares_between DOUBLE PRECISION</tt> - mean square between 00942 * groups, i.e., 00943 * \f$ 00944 * s_b^2 := \frac{\mathit{SS}_b}{k-1} 00945 * \f$ 00946 * - <tt>mean_squares_within DOUBLE PRECISION</tt> - mean square within 00947 * groups, i.e., 00948 * \f$ 00949 * s_w^2 := \frac{\mathit{SS}_w}{n-k} 00950 * \f$ 00951 * - <tt>statistic DOUBLE PRECISION</tt> - Statistic computed as 00952 * \f[ 00953 * f = \frac{s_b^2}{s_w^2}. 00954 * \f] 00955 * This statistic is Fisher F-distributed with \f$ (k-1) \f$ degrees of 00956 * freedom in the numerator and \f$ (n-k) \f$ degrees of freedom in the 00957 * denominator. 00958 * - <tt>p_value DOUBLE PRECISION</tt> - p-value, i.e., 00959 * \f$ \Pr[ F \geq f \mid H_0] \f$. 00960 * 00961 * @usage 00962 * - Test null hypothesis that the mean of the all samples is equal: 00963 * <pre>SELECT (one_way_anova(<em>group</em>, <em>value</em>)).* FROM <em>source</em></pre> 00964 */ 00965 CREATE AGGREGATE MADLIB_SCHEMA.one_way_anova( 00966 /*+ group */ INTEGER, 00967 /*+ value */ DOUBLE PRECISION) ( 00968 00969 SFUNC=MADLIB_SCHEMA.one_way_anova_transition, 00970 STYPE=DOUBLE PRECISION[], 00971 FINALFUNC=MADLIB_SCHEMA.one_way_anova_final, 00972 m4_ifdef(<!__GREENPLUM__!>,<!PREFUNC=MADLIB_SCHEMA.one_way_anova_merge_states,!>) 00973 INITCOND='{0,0}' 00974 ); 00975 00976 m4_changequote(<!`!>,<!'!>)