User Documentation
logistic.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file logistic.sql_in
00004  *
00005  * @brief SQL functions for logistic regression
00006  * @date January 2011
00007  *
00008  * @sa For a brief introduction to logistic regression, see the
00009  *     module description \ref grp_logreg.
00010  *
00011  *//* ----------------------------------------------------------------------- */
00012 
00013 m4_include(`SQLCommon.m4') --'
00014 
00015 /**
00016 @addtogroup grp_logreg
00017 
00018 @about
00019 
00020 (Binomial) Logistic regression refers to a stochastic model in which the
00021 conditional mean of the dependent dichotomous variable (usually denoted
00022 \f$ Y \in \{ 0,1 \} \f$) is the logistic function of an affine function of the
00023 vector of independent variables (usually denoted \f$ \boldsymbol x \f$). That
00024 is,
00025 \f[
00026     E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x)
00027 \f]
00028 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where
00029 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Logistic
00030 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes
00031 the likelihood of the observations.
00032 
00033 Let
00034 - \f$ \boldsymbol y \in \{ 0,1 \}^n \f$ denote the vector of observed dependent
00035   variables, with \f$ n \f$ rows, containing the observed values of the
00036   dependent variable,
00037 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$
00038   columns and \f$ n \f$ rows, containing all observed vectors of independent
00039   variables \f$ \boldsymbol x_i \f$ as rows.
00040 
00041 By definition,
00042 \f[
00043     P[Y = y_i | \boldsymbol x_i]
00044     =   \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)
00045     \,.
00046 \f]
00047 Maximizing the likelihood
00048 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$
00049 is equivalent to maximizing the log-likelihood
00050 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to
00051 \f[
00052     l(\boldsymbol c) =
00053         -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i}
00054             \cdot \boldsymbol c^T \boldsymbol x_i))
00055     \,.
00056 \f]
00057 The Hessian of this objective is \f$ H = -X^T A X \f$ where
00058 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with
00059 \f$
00060     a_i = \sigma(\boldsymbol c^T \boldsymbol x)
00061           \cdot
00062           \sigma(-\boldsymbol c^T \boldsymbol x)
00063     \,.
00064 \f$
00065 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex.
00066 There are many techniques for solving convex optimization problems. Currently,
00067 logistic regression in MADlib can use one of three algorithms:
00068 - Iteratively Reweighted Least Squares
00069 - A conjugate-gradient approach, also known as Fletcher-Reeves method in the
00070   literature, where we use the Hestenes-Stiefel rule for calculating the step
00071   size.
00072 - Incremental gradient descent, also known as incremental gradient methods or
00073   stochastic gradient descent in the literature.
00074 
00075 We estimate the standard error for coefficient \f$ i \f$ as
00076 \f[
00077     \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii}
00078     \,.
00079 \f]
00080 The Wald z-statistic is
00081 \f[
00082     z_i = \frac{c_i}{\mathit{se}(c_i)}
00083     \,.
00084 \f]
00085 
00086 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
00087 the assumptions inherent in the Wald test) of seeing a value at least as extreme
00088 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
00089 true. Letting \f$ F \f$ denote the cumulative density function of a standard
00090 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
00091 therefore
00092 \f[
00093     p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
00094 \f]
00095 where \f$ Z \f$ is a standard normally distributed random variable.
00096 
00097 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$.
00098 
00099 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration
00100 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
00101 the coefficients of the previous iteration). A large condition number (say, more
00102 than 1000) indicates the presence of significant multicollinearity.
00103 
00104 
00105 @input
00106 
00107 The training data is expected to be of the following form:\n
00108 <pre>{TABLE|VIEW} <em>sourceName</em> (
00109     ...
00110     <em>dependentVariable</em> BOOLEAN,
00111     <em>independentVariables</em> FLOAT8[],
00112     ...
00113 )</pre>
00114 
00115 @usage
00116 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
00117   statistics:\n
00118   <pre>SELECT \ref logregr_train(
00119     '<em>sourceName</em>', '<em>outName</em>', '<em>dependentVariable</em>',
00120     '<em>independentVariables</em>'[, '<em>grouping_columns</em>',
00121     [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em>
00122     [, <em>verbose</em> ]] ] ] ]
00123 );</pre>
00124   Output table:
00125   <pre>coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
00126 -----+----------------+---------+---------+----------+-------------+--------------+---------------
00127                                                ...
00128 </pre>
00129 - Get vector of coefficients \f$ \boldsymbol c \f$:\n
00130   <pre>SELECT coef from outName; </pre>
00131 - Get a subset of the output columns, e.g., only the array of coefficients
00132   \f$ \boldsymbol c \f$, the log-likelihood of determination
00133   \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
00134   <pre>SELECT coef, log_likelihood, p_values FROM outName; </pre>
00135 - By default, the option <em>verbose</em> is False. If it is set to be True, warning messages
00136   will be output to the SQL client for groups that failed.
00137 
00138 @examp
00139 
00140 -# Create the sample data set:
00141 @verbatim
00142 sql> SELECT * FROM data;
00143                   r1                      | val
00144 ---------------------------------------------+-----
00145  {1,3.01789340097457,0.454183579888195}   | t
00146  {1,-2.59380532894284,0.602678326424211}  | f
00147  {1,-1.30643094424158,0.151587064377964}  | t
00148  {1,3.60722299199551,0.963550757616758}   | t
00149  {1,-1.52197745628655,0.0782248834148049} | t
00150  {1,-4.8746574902907,0.345104880165309}   | f
00151 ...
00152 @endverbatim
00153 -# Run the logistic regression function:
00154 @verbatim
00155 sql> \x on
00156 Expanded display is off.
00157 sql> SELECT logregr_train('data', 'out_tbl', 'val', 'r1', Null, 100, 'irls', 0.001);
00158 sql> SELECT * from out_tbl;
00159 coef           | {5.59049410898112,2.11077546770772,-0.237276684606453}
00160 log_likelihood | -467.214718489873
00161 std_err        | {0.318943457652178,0.101518723785383,0.294509929481773}
00162 z_stats        | {17.5281667482197,20.7919819024719,-0.805666162169712}
00163 p_values       | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
00164 odds_ratios    | {267.867942976278,8.2546400100702,0.788773016471171}
00165 condition_no   | 179.186118573205
00166 num_iterations | 9
00167 
00168 @endverbatim
00169 
00170 @literature
00171 
00172 A somewhat random selection of nice write-ups, with valuable pointers into
00173 further literature:
00174 
00175 [1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November
00176     2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf
00177 
00178 [2] Thomas P. Minka: A comparison of numerical optimizers for logistic
00179     regression, 2003 (revised Mar 26, 2007),
00180     http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf
00181 
00182 [3] Paul Komarek, Andrew W. Moore: Making Logistic Regression A Core Data Mining
00183     Tool With TR-IRLS, IEEE International Conference on Data Mining 2005,
00184     pp. 685-688, http://komarix.org/ac/papers/tr-irls.short.pdf
00185 
00186 [4] D. P. Bertsekas: Incremental gradient, subgradient, and proximal methods for
00187     convex optimization: a survey, Technical report, Laboratory for Information
00188     and Decision Systems, 2010,
00189     http://web.mit.edu/dimitrib/www/Incremental_Survey_LIDS.pdf
00190 
00191 [5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro: Robust stochastic
00192     approximation approach to stochastic programming, SIAM Journal on
00193     Optimization, 19(4), 2009, http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdf
00194 
00195 @sa File logistic.sql_in (documenting the SQL functions)
00196 
00197 @internal
00198 @sa Namespace logistic (documenting the driver/outer loop implemented in
00199     Python), Namespace
00200     \ref madlib::modules::regress documenting the implementation in C++
00201 @endinternal
00202 
00203 */
00204 
00205 DROP TYPE IF EXISTS MADLIB_SCHEMA.__logregr_result;
00206 CREATE TYPE MADLIB_SCHEMA.__logregr_result AS (
00207     coef DOUBLE PRECISION[],
00208     log_likelihood DOUBLE PRECISION,
00209     std_err DOUBLE PRECISION[],
00210     z_stats DOUBLE PRECISION[],
00211     p_values DOUBLE PRECISION[],
00212     odds_ratios DOUBLE PRECISION[],
00213     condition_no DOUBLE PRECISION,
00214     status      INTEGER,
00215     num_iterations INTEGER
00216 );
00217 
00218 ------------------------------------------------------------------------
00219 
00220 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_transition(
00221     DOUBLE PRECISION[],
00222     BOOLEAN,
00223     DOUBLE PRECISION[],
00224     DOUBLE PRECISION[])
00225 RETURNS DOUBLE PRECISION[]
00226 AS 'MODULE_PATHNAME', 'logregr_cg_step_transition'
00227 LANGUAGE C IMMUTABLE;
00228 
00229 ------------------------------------------------------------------------
00230 
00231 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_transition(
00232     DOUBLE PRECISION[],
00233     BOOLEAN,
00234     DOUBLE PRECISION[],
00235     DOUBLE PRECISION[])
00236 RETURNS DOUBLE PRECISION[]
00237 AS 'MODULE_PATHNAME', 'logregr_irls_step_transition'
00238 LANGUAGE C IMMUTABLE;
00239 
00240 ------------------------------------------------------------------------
00241 
00242 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_transition(
00243     DOUBLE PRECISION[],
00244     BOOLEAN,
00245     DOUBLE PRECISION[],
00246     DOUBLE PRECISION[])
00247 RETURNS DOUBLE PRECISION[]
00248 AS 'MODULE_PATHNAME', 'logregr_igd_step_transition'
00249 LANGUAGE C IMMUTABLE;
00250 
00251 ------------------------------------------------------------------------
00252 
00253 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_merge_states(
00254     state1 DOUBLE PRECISION[],
00255     state2 DOUBLE PRECISION[])
00256 RETURNS DOUBLE PRECISION[]
00257 AS 'MODULE_PATHNAME', 'logregr_cg_step_merge_states'
00258 LANGUAGE C IMMUTABLE STRICT;
00259 
00260 ------------------------------------------------------------------------
00261 
00262 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_merge_states(
00263     state1 DOUBLE PRECISION[],
00264     state2 DOUBLE PRECISION[])
00265 RETURNS DOUBLE PRECISION[]
00266 AS 'MODULE_PATHNAME', 'logregr_irls_step_merge_states'
00267 LANGUAGE C IMMUTABLE STRICT;
00268 
00269 ------------------------------------------------------------------------
00270 
00271 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_merge_states(
00272     state1 DOUBLE PRECISION[],
00273     state2 DOUBLE PRECISION[])
00274 RETURNS DOUBLE PRECISION[]
00275 AS 'MODULE_PATHNAME', 'logregr_igd_step_merge_states'
00276 LANGUAGE C IMMUTABLE STRICT;
00277 
00278 ------------------------------------------------------------------------
00279 
00280 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_final(
00281     state DOUBLE PRECISION[])
00282 RETURNS DOUBLE PRECISION[]
00283 AS 'MODULE_PATHNAME', 'logregr_cg_step_final'
00284 LANGUAGE C IMMUTABLE STRICT;
00285 
00286 ------------------------------------------------------------------------
00287 
00288 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_final(
00289     state DOUBLE PRECISION[])
00290 RETURNS DOUBLE PRECISION[]
00291 AS 'MODULE_PATHNAME', 'logregr_irls_step_final'
00292 LANGUAGE C IMMUTABLE STRICT;
00293 
00294 ------------------------------------------------------------------------
00295 
00296 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_final(
00297     state DOUBLE PRECISION[])
00298 RETURNS DOUBLE PRECISION[]
00299 AS 'MODULE_PATHNAME', 'logregr_igd_step_final'
00300 LANGUAGE C IMMUTABLE STRICT;
00301 
00302 ------------------------------------------------------------------------
00303 
00304 /**
00305  * @internal
00306  * @brief Perform one iteration of the conjugate-gradient method for computing
00307  *        logistic regression
00308  */
00309 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_cg_step(
00310     /*+ y */ BOOLEAN,
00311     /*+ x */ DOUBLE PRECISION[],
00312     /*+ previous_state */ DOUBLE PRECISION[]) (
00313 
00314     STYPE=DOUBLE PRECISION[],
00315     SFUNC=MADLIB_SCHEMA.__logregr_cg_step_transition,
00316     m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_cg_step_merge_states,')
00317     FINALFUNC=MADLIB_SCHEMA.__logregr_cg_step_final,
00318     INITCOND='{0,0,0,0,0,0}'
00319 );
00320 
00321 ------------------------------------------------------------------------
00322 
00323 /**
00324  * @internal
00325  * @brief Perform one iteration of the iteratively-reweighted-least-squares
00326  *        method for computing linear regression
00327  */
00328 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_irls_step(
00329     /*+ y */ BOOLEAN,
00330     /*+ x */ DOUBLE PRECISION[],
00331     /*+ previous_state */ DOUBLE PRECISION[]) (
00332 
00333     STYPE=DOUBLE PRECISION[],
00334     SFUNC=MADLIB_SCHEMA.__logregr_irls_step_transition,
00335     m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_irls_step_merge_states,')
00336     FINALFUNC=MADLIB_SCHEMA.__logregr_irls_step_final,
00337     INITCOND='{0,0,0,0}'
00338 );
00339 
00340 ------------------------------------------------------------------------
00341 
00342 /**
00343  * @internal
00344  * @brief Perform one iteration of the incremental gradient
00345  *        method for computing logistic regression
00346  */
00347 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_igd_step(
00348     /*+ y */ BOOLEAN,
00349     /*+ x */ DOUBLE PRECISION[],
00350     /*+ previous_state */ DOUBLE PRECISION[]) (
00351 
00352     STYPE=DOUBLE PRECISION[],
00353     SFUNC=MADLIB_SCHEMA.__logregr_igd_step_transition,
00354     m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_igd_step_merge_states,')
00355     FINALFUNC=MADLIB_SCHEMA.__logregr_igd_step_final,
00356     INITCOND='{0,0,0,0,0}'
00357 );
00358 
00359 ------------------------------------------------------------------------
00360 
00361 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_distance(
00362     /*+ state1 */ DOUBLE PRECISION[],
00363     /*+ state2 */ DOUBLE PRECISION[])
00364 RETURNS DOUBLE PRECISION AS
00365 'MODULE_PATHNAME', 'internal_logregr_cg_step_distance'
00366 LANGUAGE c IMMUTABLE STRICT;
00367 
00368 ------------------------------------------------------------------------
00369 
00370 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_result(
00371     /*+ state */ DOUBLE PRECISION[])
00372 RETURNS MADLIB_SCHEMA.__logregr_result AS
00373 'MODULE_PATHNAME', 'internal_logregr_cg_result'
00374 LANGUAGE c IMMUTABLE STRICT;
00375 
00376 ------------------------------------------------------------------------
00377 
00378 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_distance(
00379     /*+ state1 */ DOUBLE PRECISION[],
00380     /*+ state2 */ DOUBLE PRECISION[])
00381 RETURNS DOUBLE PRECISION AS
00382 'MODULE_PATHNAME', 'internal_logregr_irls_step_distance'
00383 LANGUAGE c IMMUTABLE STRICT;
00384 
00385 ------------------------------------------------------------------------
00386 
00387 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_result(
00388     /*+ state */ DOUBLE PRECISION[])
00389 RETURNS MADLIB_SCHEMA.__logregr_result AS
00390 'MODULE_PATHNAME', 'internal_logregr_irls_result'
00391 LANGUAGE c IMMUTABLE STRICT;
00392 
00393 ------------------------------------------------------------------------
00394 
00395 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_distance(
00396     /*+ state1 */ DOUBLE PRECISION[],
00397     /*+ state2 */ DOUBLE PRECISION[])
00398 RETURNS DOUBLE PRECISION AS
00399 'MODULE_PATHNAME', 'internal_logregr_igd_step_distance'
00400 LANGUAGE c IMMUTABLE STRICT;
00401 
00402 ------------------------------------------------------------------------
00403 
00404 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_result(
00405     /*+ state */ DOUBLE PRECISION[])
00406 RETURNS MADLIB_SCHEMA.__logregr_result AS
00407 'MODULE_PATHNAME', 'internal_logregr_igd_result'
00408 LANGUAGE c IMMUTABLE STRICT;
00409 
00410 ------------------------------------------------------------------------
00411 
00412 /**
00413  * @brief Compute logistic-regression coefficients and diagnostic statistics
00414  *
00415  * To include an intercept in the model, set one coordinate in the
00416  * <tt>independentVariables</tt> array to 1.
00417  *
00418  * @param tbl_source Name of the source relation containing the training data
00419  * @param tbl_output Name of the output relation to store the model results
00420  *                   Columns of the output relation are as follows:
00421  *                    - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
00422  *                    - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$
00423  *                    - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
00424  *                      \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
00425  *                    - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
00426  *                    - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
00427  *                    - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios,
00428  *                      \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$
00429  *                    - <tt>condition_no FLOAT8</tt> - The condition number of
00430  *                          matrix \f$ X^T A X \f$ during the iteration
00431  *                          immediately <em>preceding</em> convergence
00432  *                          (i.e., \f$ A \f$ is computed using the coefficients
00433  *                          of the previous iteration)
00434  * @param dep_col Name of the dependent column (of type BOOLEAN)
00435  * @param ind_col Name of the independent column (of type DOUBLE
00436  *        PRECISION[])
00437  * @param grouping_col Comma delimited list of column names to group-by
00438  * @param max_iter The maximum number of iterations
00439  * @param optimizer The optimizer to use (either
00440  *        <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least
00441  *        squares or <tt>'cg'</tt> for conjugent gradient)
00442  * @param tolerance The difference between log-likelihood values in successive
00443  *         iterations that should indicate convergence. This value should be
00444  *         non-negative and a zero value here disables the convergence criterion,
00445  *         and execution will only stop after \c maxNumIterations iterations.
00446  * @param verbose If true, any error or warning message will be printed to the
00447  *         console (irrespective of the 'client_min_messages' set by server).
00448  *         If false, no error/warning message is printed to console.
00449  *
00450  *
00451  * @usage
00452  *  - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
00453  *    statistics:\n
00454  *    <pre>SELECT logregr_train('<em>sourceName</em>', '<em>outName</em>'
00455  *           '<em>dependentVariable</em>', '<em>independentVariables</em>');
00456  *          SELECT * from outName;
00457  *    </pre>
00458  *  - Get vector of coefficients \f$ \boldsymbol c \f$:\n
00459  *    <pre>SELECT coef from outName;</pre>
00460  *  - Get a subset of the output columns, e.g., only the array of coefficients
00461  *    \f$ \boldsymbol c \f$, the log-likelihood of determination
00462  *    \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
00463  *    <pre>SELECT coef, log_likelihood, p_values FROM outName;</pre>
00464  *
00465  * @note This function starts an iterative algorithm. It is not an aggregate
00466  *       function. Source, output, and column names have to be passed as strings
00467  *       (due to limitations of the SQL syntax).
00468  *
00469  * @internal
00470  * @sa This function is a wrapper for logistic::compute_logregr(), which
00471  *     sets the default values.
00472  */
00473 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_train (
00474     tbl_source          VARCHAR,
00475     tbl_output          VARCHAR,
00476     dep_col             VARCHAR,
00477     ind_col             VARCHAR,
00478     grouping_col        VARCHAR,
00479     max_iter            INTEGER,
00480     optimizer           VARCHAR,
00481     tolerance           DOUBLE PRECISION,
00482     verbose             BOOLEAN
00483 ) RETURNS VOID AS $$
00484 PythonFunction(regress, logistic, logregr_train)
00485 $$ LANGUAGE plpythonu;
00486 
00487 ------------------------------------------------------------------------
00488 
00489 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
00490     tbl_source          VARCHAR,
00491     tbl_output          VARCHAR,
00492     dep_col             VARCHAR,
00493     ind_col             VARCHAR)
00494 RETURNS VOID AS $$
00495     SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, NULL::VARCHAR, 20, 'irls', 0.0001, False);
00496 $$ LANGUAGE sql VOLATILE;
00497 
00498 ------------------------------------------------------------------------
00499 
00500 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
00501     tbl_source          VARCHAR,
00502     tbl_output          VARCHAR,
00503     dep_col             VARCHAR,
00504     ind_col             VARCHAR,
00505     grouping_col        VARCHAR)
00506 RETURNS VOID AS $$
00507     SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, 20, 'irls', 0.0001, False);
00508 $$LANGUAGE sql VOLATILE;
00509 
00510 ------------------------------------------------------------------------
00511 
00512 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
00513     tbl_source          VARCHAR,
00514     tbl_output          VARCHAR,
00515     dep_col             VARCHAR,
00516     ind_col             VARCHAR,
00517     grouping_col        VARCHAR,
00518     max_iter            INTEGER)
00519 RETURNS VOID AS $$
00520     SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, 'irls', 0.0001, False);
00521 $$LANGUAGE sql VOLATILE;
00522 
00523 ------------------------------------------------------------------------
00524 
00525 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
00526     tbl_source          VARCHAR,
00527     tbl_output          VARCHAR,
00528     dep_col             VARCHAR,
00529     ind_col             VARCHAR,
00530     grouping_col        VARCHAR,
00531     max_iter            INTEGER,
00532     optimizer           VARCHAR)
00533 RETURNS VOID AS $$
00534     SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, 0.0001, False);
00535 $$ LANGUAGE sql VOLATILE;
00536 
00537 ------------------------------------------------------------------------
00538 
00539 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
00540     tbl_source          VARCHAR,
00541     tbl_output          VARCHAR,
00542     dep_col             VARCHAR,
00543     ind_col             VARCHAR,
00544     grouping_col        VARCHAR,
00545     max_iter            INTEGER,
00546     optimizer           VARCHAR,
00547     tolerance           DOUBLE PRECISION)
00548 RETURNS VOID AS $$
00549     SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, $8, False);
00550 $$ LANGUAGE sql VOLATILE;
00551 
00552 ------------------------------------------------------------------------
00553 
00554 /**
00555  * @brief Evaluate the usual logistic function in an under-/overflow-safe way
00556  *
00557  * @param x
00558  * @returns \f$ \frac{1}{1 + \exp(-x)} \f$
00559  *
00560  * Evaluating this expression directly can lead to under- or overflows.
00561  * This function performs the evaluation in a safe manner, making use of the
00562  * following observations:
00563  *
00564  * In order for the outcome of \f$ \exp(x) \f$ to be within the range of the
00565  * minimum positive double-precision number (i.e., \f$ 2^{-1074} \f$) and the
00566  * maximum positive double-precision number (i.e.,
00567  * \f$ (1 + (1 - 2^{52})) * 2^{1023}) \f$, \f$ x \f$ has to be within the
00568  * natural logarithm of these numbers, so roughly in between -744 and 709.
00569  * However, \f$ 1 + \exp(x) \f$ will just evaluate to 1 if \f$ \exp(x) \f$ is
00570  * less than the machine epsilon (i.e., \f$ 2^{-52} \f$) or, equivalently, if
00571  * \f$ x \f$ is less than the natural logarithm of that; i.e., in any case if
00572  * \f$ x \f$ is less than -37.
00573  * Note that taking the reciprocal of the largest double-precision number will
00574  * not cause an underflow. Hence, no further checks are necessary.
00575  */
00576 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logistic(x DOUBLE PRECISION)
00577 RETURNS DOUBLE PRECISION
00578 LANGUAGE sql
00579 AS $$
00580    SELECT CASE WHEN -$1 < -37 THEN 1
00581                WHEN -$1 > 709 THEN 0
00582                ELSE 1 / (1 + exp(-$1))
00583           END;
00584 $$;
00585