MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file logistic.sql_in 00004 * 00005 * @brief SQL functions for logistic regression 00006 * @date January 2011 00007 * 00008 * @sa For a brief introduction to logistic regression, see the 00009 * module description \ref grp_logreg. 00010 * 00011 *//* ----------------------------------------------------------------------- */ 00012 00013 m4_include(`SQLCommon.m4') --' 00014 00015 /** 00016 @addtogroup grp_logreg 00017 00018 @about 00019 00020 (Binomial) Logistic regression refers to a stochastic model in which the 00021 conditional mean of the dependent dichotomous variable (usually denoted 00022 \f$ Y \in \{ 0,1 \} \f$) is the logistic function of an affine function of the 00023 vector of independent variables (usually denoted \f$ \boldsymbol x \f$). That 00024 is, 00025 \f[ 00026 E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x) 00027 \f] 00028 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where 00029 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Logistic 00030 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes 00031 the likelihood of the observations. 00032 00033 Let 00034 - \f$ \boldsymbol y \in \{ 0,1 \}^n \f$ denote the vector of observed dependent 00035 variables, with \f$ n \f$ rows, containing the observed values of the 00036 dependent variable, 00037 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$ 00038 columns and \f$ n \f$ rows, containing all observed vectors of independent 00039 variables \f$ \boldsymbol x_i \f$ as rows. 00040 00041 By definition, 00042 \f[ 00043 P[Y = y_i | \boldsymbol x_i] 00044 = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i) 00045 \,. 00046 \f] 00047 Maximizing the likelihood 00048 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$ 00049 is equivalent to maximizing the log-likelihood 00050 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to 00051 \f[ 00052 l(\boldsymbol c) = 00053 -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i} 00054 \cdot \boldsymbol c^T \boldsymbol x_i)) 00055 \,. 00056 \f] 00057 The Hessian of this objective is \f$ H = -X^T A X \f$ where 00058 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with 00059 \f$ 00060 a_i = \sigma(\boldsymbol c^T \boldsymbol x) 00061 \cdot 00062 \sigma(-\boldsymbol c^T \boldsymbol x) 00063 \,. 00064 \f$ 00065 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex. 00066 There are many techniques for solving convex optimization problems. Currently, 00067 logistic regression in MADlib can use one of three algorithms: 00068 - Iteratively Reweighted Least Squares 00069 - A conjugate-gradient approach, also known as Fletcher-Reeves method in the 00070 literature, where we use the Hestenes-Stiefel rule for calculating the step 00071 size. 00072 - Incremental gradient descent, also known as incremental gradient methods or 00073 stochastic gradient descent in the literature. 00074 00075 We estimate the standard error for coefficient \f$ i \f$ as 00076 \f[ 00077 \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii} 00078 \,. 00079 \f] 00080 The Wald z-statistic is 00081 \f[ 00082 z_i = \frac{c_i}{\mathit{se}(c_i)} 00083 \,. 00084 \f] 00085 00086 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under 00087 the assumptions inherent in the Wald test) of seeing a value at least as extreme 00088 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is 00089 true. Letting \f$ F \f$ denote the cumulative density function of a standard 00090 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is 00091 therefore 00092 \f[ 00093 p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) 00094 \f] 00095 where \f$ Z \f$ is a standard normally distributed random variable. 00096 00097 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$. 00098 00099 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration 00100 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using 00101 the coefficients of the previous iteration). A large condition number (say, more 00102 than 1000) indicates the presence of significant multicollinearity. 00103 00104 00105 @input 00106 00107 The training data is expected to be of the following form:\n 00108 <pre>{TABLE|VIEW} <em>sourceName</em> ( 00109 ... 00110 <em>dependentVariable</em> BOOLEAN, 00111 <em>independentVariables</em> FLOAT8[], 00112 ... 00113 )</pre> 00114 00115 @usage 00116 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic 00117 statistics:\n 00118 <pre>SELECT \ref logregr_train( 00119 '<em>sourceName</em>', '<em>outName</em>', '<em>dependentVariable</em>', 00120 '<em>independentVariables</em>'[, '<em>grouping_columns</em>', 00121 [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> 00122 [, <em>verbose</em> ]] ] ] ] 00123 );</pre> 00124 Output table: 00125 <pre>coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations 00126 -----+----------------+---------+---------+----------+-------------+--------------+--------------- 00127 ... 00128 </pre> 00129 - Get vector of coefficients \f$ \boldsymbol c \f$:\n 00130 <pre>SELECT coef from outName; </pre> 00131 - Get a subset of the output columns, e.g., only the array of coefficients 00132 \f$ \boldsymbol c \f$, the log-likelihood of determination 00133 \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$: 00134 <pre>SELECT coef, log_likelihood, p_values FROM outName; </pre> 00135 - By default, the option <em>verbose</em> is False. If it is set to be True, warning messages 00136 will be output to the SQL client for groups that failed. 00137 00138 @examp 00139 00140 -# Create the sample data set: 00141 @verbatim 00142 sql> SELECT * FROM data; 00143 r1 | val 00144 ---------------------------------------------+----- 00145 {1,3.01789340097457,0.454183579888195} | t 00146 {1,-2.59380532894284,0.602678326424211} | f 00147 {1,-1.30643094424158,0.151587064377964} | t 00148 {1,3.60722299199551,0.963550757616758} | t 00149 {1,-1.52197745628655,0.0782248834148049} | t 00150 {1,-4.8746574902907,0.345104880165309} | f 00151 ... 00152 @endverbatim 00153 -# Run the logistic regression function: 00154 @verbatim 00155 sql> \x on 00156 Expanded display is off. 00157 sql> SELECT logregr_train('data', 'out_tbl', 'val', 'r1', Null, 100, 'irls', 0.001); 00158 sql> SELECT * from out_tbl; 00159 coef | {5.59049410898112,2.11077546770772,-0.237276684606453} 00160 log_likelihood | -467.214718489873 00161 std_err | {0.318943457652178,0.101518723785383,0.294509929481773} 00162 z_stats | {17.5281667482197,20.7919819024719,-0.805666162169712} 00163 p_values | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518} 00164 odds_ratios | {267.867942976278,8.2546400100702,0.788773016471171} 00165 condition_no | 179.186118573205 00166 num_iterations | 9 00167 00168 @endverbatim 00169 00170 @literature 00171 00172 A somewhat random selection of nice write-ups, with valuable pointers into 00173 further literature: 00174 00175 [1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November 00176 2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf 00177 00178 [2] Thomas P. Minka: A comparison of numerical optimizers for logistic 00179 regression, 2003 (revised Mar 26, 2007), 00180 http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf 00181 00182 [3] Paul Komarek, Andrew W. Moore: Making Logistic Regression A Core Data Mining 00183 Tool With TR-IRLS, IEEE International Conference on Data Mining 2005, 00184 pp. 685-688, http://komarix.org/ac/papers/tr-irls.short.pdf 00185 00186 [4] D. P. Bertsekas: Incremental gradient, subgradient, and proximal methods for 00187 convex optimization: a survey, Technical report, Laboratory for Information 00188 and Decision Systems, 2010, 00189 http://web.mit.edu/dimitrib/www/Incremental_Survey_LIDS.pdf 00190 00191 [5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro: Robust stochastic 00192 approximation approach to stochastic programming, SIAM Journal on 00193 Optimization, 19(4), 2009, http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdf 00194 00195 @sa File logistic.sql_in (documenting the SQL functions) 00196 00197 @internal 00198 @sa Namespace logistic (documenting the driver/outer loop implemented in 00199 Python), Namespace 00200 \ref madlib::modules::regress documenting the implementation in C++ 00201 @endinternal 00202 00203 */ 00204 00205 DROP TYPE IF EXISTS MADLIB_SCHEMA.__logregr_result; 00206 CREATE TYPE MADLIB_SCHEMA.__logregr_result AS ( 00207 coef DOUBLE PRECISION[], 00208 log_likelihood DOUBLE PRECISION, 00209 std_err DOUBLE PRECISION[], 00210 z_stats DOUBLE PRECISION[], 00211 p_values DOUBLE PRECISION[], 00212 odds_ratios DOUBLE PRECISION[], 00213 condition_no DOUBLE PRECISION, 00214 status INTEGER, 00215 num_iterations INTEGER 00216 ); 00217 00218 ------------------------------------------------------------------------ 00219 00220 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_transition( 00221 DOUBLE PRECISION[], 00222 BOOLEAN, 00223 DOUBLE PRECISION[], 00224 DOUBLE PRECISION[]) 00225 RETURNS DOUBLE PRECISION[] 00226 AS 'MODULE_PATHNAME', 'logregr_cg_step_transition' 00227 LANGUAGE C IMMUTABLE; 00228 00229 ------------------------------------------------------------------------ 00230 00231 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_transition( 00232 DOUBLE PRECISION[], 00233 BOOLEAN, 00234 DOUBLE PRECISION[], 00235 DOUBLE PRECISION[]) 00236 RETURNS DOUBLE PRECISION[] 00237 AS 'MODULE_PATHNAME', 'logregr_irls_step_transition' 00238 LANGUAGE C IMMUTABLE; 00239 00240 ------------------------------------------------------------------------ 00241 00242 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_transition( 00243 DOUBLE PRECISION[], 00244 BOOLEAN, 00245 DOUBLE PRECISION[], 00246 DOUBLE PRECISION[]) 00247 RETURNS DOUBLE PRECISION[] 00248 AS 'MODULE_PATHNAME', 'logregr_igd_step_transition' 00249 LANGUAGE C IMMUTABLE; 00250 00251 ------------------------------------------------------------------------ 00252 00253 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_merge_states( 00254 state1 DOUBLE PRECISION[], 00255 state2 DOUBLE PRECISION[]) 00256 RETURNS DOUBLE PRECISION[] 00257 AS 'MODULE_PATHNAME', 'logregr_cg_step_merge_states' 00258 LANGUAGE C IMMUTABLE STRICT; 00259 00260 ------------------------------------------------------------------------ 00261 00262 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_merge_states( 00263 state1 DOUBLE PRECISION[], 00264 state2 DOUBLE PRECISION[]) 00265 RETURNS DOUBLE PRECISION[] 00266 AS 'MODULE_PATHNAME', 'logregr_irls_step_merge_states' 00267 LANGUAGE C IMMUTABLE STRICT; 00268 00269 ------------------------------------------------------------------------ 00270 00271 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_merge_states( 00272 state1 DOUBLE PRECISION[], 00273 state2 DOUBLE PRECISION[]) 00274 RETURNS DOUBLE PRECISION[] 00275 AS 'MODULE_PATHNAME', 'logregr_igd_step_merge_states' 00276 LANGUAGE C IMMUTABLE STRICT; 00277 00278 ------------------------------------------------------------------------ 00279 00280 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_final( 00281 state DOUBLE PRECISION[]) 00282 RETURNS DOUBLE PRECISION[] 00283 AS 'MODULE_PATHNAME', 'logregr_cg_step_final' 00284 LANGUAGE C IMMUTABLE STRICT; 00285 00286 ------------------------------------------------------------------------ 00287 00288 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_final( 00289 state DOUBLE PRECISION[]) 00290 RETURNS DOUBLE PRECISION[] 00291 AS 'MODULE_PATHNAME', 'logregr_irls_step_final' 00292 LANGUAGE C IMMUTABLE STRICT; 00293 00294 ------------------------------------------------------------------------ 00295 00296 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_final( 00297 state DOUBLE PRECISION[]) 00298 RETURNS DOUBLE PRECISION[] 00299 AS 'MODULE_PATHNAME', 'logregr_igd_step_final' 00300 LANGUAGE C IMMUTABLE STRICT; 00301 00302 ------------------------------------------------------------------------ 00303 00304 /** 00305 * @internal 00306 * @brief Perform one iteration of the conjugate-gradient method for computing 00307 * logistic regression 00308 */ 00309 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_cg_step( 00310 /*+ y */ BOOLEAN, 00311 /*+ x */ DOUBLE PRECISION[], 00312 /*+ previous_state */ DOUBLE PRECISION[]) ( 00313 00314 STYPE=DOUBLE PRECISION[], 00315 SFUNC=MADLIB_SCHEMA.__logregr_cg_step_transition, 00316 m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_cg_step_merge_states,') 00317 FINALFUNC=MADLIB_SCHEMA.__logregr_cg_step_final, 00318 INITCOND='{0,0,0,0,0,0}' 00319 ); 00320 00321 ------------------------------------------------------------------------ 00322 00323 /** 00324 * @internal 00325 * @brief Perform one iteration of the iteratively-reweighted-least-squares 00326 * method for computing linear regression 00327 */ 00328 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_irls_step( 00329 /*+ y */ BOOLEAN, 00330 /*+ x */ DOUBLE PRECISION[], 00331 /*+ previous_state */ DOUBLE PRECISION[]) ( 00332 00333 STYPE=DOUBLE PRECISION[], 00334 SFUNC=MADLIB_SCHEMA.__logregr_irls_step_transition, 00335 m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_irls_step_merge_states,') 00336 FINALFUNC=MADLIB_SCHEMA.__logregr_irls_step_final, 00337 INITCOND='{0,0,0,0}' 00338 ); 00339 00340 ------------------------------------------------------------------------ 00341 00342 /** 00343 * @internal 00344 * @brief Perform one iteration of the incremental gradient 00345 * method for computing logistic regression 00346 */ 00347 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_igd_step( 00348 /*+ y */ BOOLEAN, 00349 /*+ x */ DOUBLE PRECISION[], 00350 /*+ previous_state */ DOUBLE PRECISION[]) ( 00351 00352 STYPE=DOUBLE PRECISION[], 00353 SFUNC=MADLIB_SCHEMA.__logregr_igd_step_transition, 00354 m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_igd_step_merge_states,') 00355 FINALFUNC=MADLIB_SCHEMA.__logregr_igd_step_final, 00356 INITCOND='{0,0,0,0,0}' 00357 ); 00358 00359 ------------------------------------------------------------------------ 00360 00361 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_distance( 00362 /*+ state1 */ DOUBLE PRECISION[], 00363 /*+ state2 */ DOUBLE PRECISION[]) 00364 RETURNS DOUBLE PRECISION AS 00365 'MODULE_PATHNAME', 'internal_logregr_cg_step_distance' 00366 LANGUAGE c IMMUTABLE STRICT; 00367 00368 ------------------------------------------------------------------------ 00369 00370 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_result( 00371 /*+ state */ DOUBLE PRECISION[]) 00372 RETURNS MADLIB_SCHEMA.__logregr_result AS 00373 'MODULE_PATHNAME', 'internal_logregr_cg_result' 00374 LANGUAGE c IMMUTABLE STRICT; 00375 00376 ------------------------------------------------------------------------ 00377 00378 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_distance( 00379 /*+ state1 */ DOUBLE PRECISION[], 00380 /*+ state2 */ DOUBLE PRECISION[]) 00381 RETURNS DOUBLE PRECISION AS 00382 'MODULE_PATHNAME', 'internal_logregr_irls_step_distance' 00383 LANGUAGE c IMMUTABLE STRICT; 00384 00385 ------------------------------------------------------------------------ 00386 00387 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_result( 00388 /*+ state */ DOUBLE PRECISION[]) 00389 RETURNS MADLIB_SCHEMA.__logregr_result AS 00390 'MODULE_PATHNAME', 'internal_logregr_irls_result' 00391 LANGUAGE c IMMUTABLE STRICT; 00392 00393 ------------------------------------------------------------------------ 00394 00395 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_distance( 00396 /*+ state1 */ DOUBLE PRECISION[], 00397 /*+ state2 */ DOUBLE PRECISION[]) 00398 RETURNS DOUBLE PRECISION AS 00399 'MODULE_PATHNAME', 'internal_logregr_igd_step_distance' 00400 LANGUAGE c IMMUTABLE STRICT; 00401 00402 ------------------------------------------------------------------------ 00403 00404 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_result( 00405 /*+ state */ DOUBLE PRECISION[]) 00406 RETURNS MADLIB_SCHEMA.__logregr_result AS 00407 'MODULE_PATHNAME', 'internal_logregr_igd_result' 00408 LANGUAGE c IMMUTABLE STRICT; 00409 00410 ------------------------------------------------------------------------ 00411 00412 /** 00413 * @brief Compute logistic-regression coefficients and diagnostic statistics 00414 * 00415 * To include an intercept in the model, set one coordinate in the 00416 * <tt>independentVariables</tt> array to 1. 00417 * 00418 * @param tbl_source Name of the source relation containing the training data 00419 * @param tbl_output Name of the output relation to store the model results 00420 * Columns of the output relation are as follows: 00421 * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$ 00422 * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$ 00423 * - <tt>std_err FLOAT8[]</tt> - Array of standard errors, 00424 * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$ 00425 * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$ 00426 * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$ 00427 * - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios, 00428 * \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$ 00429 * - <tt>condition_no FLOAT8</tt> - The condition number of 00430 * matrix \f$ X^T A X \f$ during the iteration 00431 * immediately <em>preceding</em> convergence 00432 * (i.e., \f$ A \f$ is computed using the coefficients 00433 * of the previous iteration) 00434 * @param dep_col Name of the dependent column (of type BOOLEAN) 00435 * @param ind_col Name of the independent column (of type DOUBLE 00436 * PRECISION[]) 00437 * @param grouping_col Comma delimited list of column names to group-by 00438 * @param max_iter The maximum number of iterations 00439 * @param optimizer The optimizer to use (either 00440 * <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least 00441 * squares or <tt>'cg'</tt> for conjugent gradient) 00442 * @param tolerance The difference between log-likelihood values in successive 00443 * iterations that should indicate convergence. This value should be 00444 * non-negative and a zero value here disables the convergence criterion, 00445 * and execution will only stop after \c maxNumIterations iterations. 00446 * @param verbose If true, any error or warning message will be printed to the 00447 * console (irrespective of the 'client_min_messages' set by server). 00448 * If false, no error/warning message is printed to console. 00449 * 00450 * 00451 * @usage 00452 * - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic 00453 * statistics:\n 00454 * <pre>SELECT logregr_train('<em>sourceName</em>', '<em>outName</em>' 00455 * '<em>dependentVariable</em>', '<em>independentVariables</em>'); 00456 * SELECT * from outName; 00457 * </pre> 00458 * - Get vector of coefficients \f$ \boldsymbol c \f$:\n 00459 * <pre>SELECT coef from outName;</pre> 00460 * - Get a subset of the output columns, e.g., only the array of coefficients 00461 * \f$ \boldsymbol c \f$, the log-likelihood of determination 00462 * \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$: 00463 * <pre>SELECT coef, log_likelihood, p_values FROM outName;</pre> 00464 * 00465 * @note This function starts an iterative algorithm. It is not an aggregate 00466 * function. Source, output, and column names have to be passed as strings 00467 * (due to limitations of the SQL syntax). 00468 * 00469 * @internal 00470 * @sa This function is a wrapper for logistic::compute_logregr(), which 00471 * sets the default values. 00472 */ 00473 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_train ( 00474 tbl_source VARCHAR, 00475 tbl_output VARCHAR, 00476 dep_col VARCHAR, 00477 ind_col VARCHAR, 00478 grouping_col VARCHAR, 00479 max_iter INTEGER, 00480 optimizer VARCHAR, 00481 tolerance DOUBLE PRECISION, 00482 verbose BOOLEAN 00483 ) RETURNS VOID AS $$ 00484 PythonFunction(regress, logistic, logregr_train) 00485 $$ LANGUAGE plpythonu; 00486 00487 ------------------------------------------------------------------------ 00488 00489 CREATE FUNCTION MADLIB_SCHEMA.logregr_train ( 00490 tbl_source VARCHAR, 00491 tbl_output VARCHAR, 00492 dep_col VARCHAR, 00493 ind_col VARCHAR) 00494 RETURNS VOID AS $$ 00495 SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, NULL::VARCHAR, 20, 'irls', 0.0001, False); 00496 $$ LANGUAGE sql VOLATILE; 00497 00498 ------------------------------------------------------------------------ 00499 00500 CREATE FUNCTION MADLIB_SCHEMA.logregr_train ( 00501 tbl_source VARCHAR, 00502 tbl_output VARCHAR, 00503 dep_col VARCHAR, 00504 ind_col VARCHAR, 00505 grouping_col VARCHAR) 00506 RETURNS VOID AS $$ 00507 SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, 20, 'irls', 0.0001, False); 00508 $$LANGUAGE sql VOLATILE; 00509 00510 ------------------------------------------------------------------------ 00511 00512 CREATE FUNCTION MADLIB_SCHEMA.logregr_train ( 00513 tbl_source VARCHAR, 00514 tbl_output VARCHAR, 00515 dep_col VARCHAR, 00516 ind_col VARCHAR, 00517 grouping_col VARCHAR, 00518 max_iter INTEGER) 00519 RETURNS VOID AS $$ 00520 SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, 'irls', 0.0001, False); 00521 $$LANGUAGE sql VOLATILE; 00522 00523 ------------------------------------------------------------------------ 00524 00525 CREATE FUNCTION MADLIB_SCHEMA.logregr_train ( 00526 tbl_source VARCHAR, 00527 tbl_output VARCHAR, 00528 dep_col VARCHAR, 00529 ind_col VARCHAR, 00530 grouping_col VARCHAR, 00531 max_iter INTEGER, 00532 optimizer VARCHAR) 00533 RETURNS VOID AS $$ 00534 SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, 0.0001, False); 00535 $$ LANGUAGE sql VOLATILE; 00536 00537 ------------------------------------------------------------------------ 00538 00539 CREATE FUNCTION MADLIB_SCHEMA.logregr_train ( 00540 tbl_source VARCHAR, 00541 tbl_output VARCHAR, 00542 dep_col VARCHAR, 00543 ind_col VARCHAR, 00544 grouping_col VARCHAR, 00545 max_iter INTEGER, 00546 optimizer VARCHAR, 00547 tolerance DOUBLE PRECISION) 00548 RETURNS VOID AS $$ 00549 SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, $8, False); 00550 $$ LANGUAGE sql VOLATILE; 00551 00552 ------------------------------------------------------------------------ 00553 00554 /** 00555 * @brief Evaluate the usual logistic function in an under-/overflow-safe way 00556 * 00557 * @param x 00558 * @returns \f$ \frac{1}{1 + \exp(-x)} \f$ 00559 * 00560 * Evaluating this expression directly can lead to under- or overflows. 00561 * This function performs the evaluation in a safe manner, making use of the 00562 * following observations: 00563 * 00564 * In order for the outcome of \f$ \exp(x) \f$ to be within the range of the 00565 * minimum positive double-precision number (i.e., \f$ 2^{-1074} \f$) and the 00566 * maximum positive double-precision number (i.e., 00567 * \f$ (1 + (1 - 2^{52})) * 2^{1023}) \f$, \f$ x \f$ has to be within the 00568 * natural logarithm of these numbers, so roughly in between -744 and 709. 00569 * However, \f$ 1 + \exp(x) \f$ will just evaluate to 1 if \f$ \exp(x) \f$ is 00570 * less than the machine epsilon (i.e., \f$ 2^{-52} \f$) or, equivalently, if 00571 * \f$ x \f$ is less than the natural logarithm of that; i.e., in any case if 00572 * \f$ x \f$ is less than -37. 00573 * Note that taking the reciprocal of the largest double-precision number will 00574 * not cause an underflow. Hence, no further checks are necessary. 00575 */ 00576 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logistic(x DOUBLE PRECISION) 00577 RETURNS DOUBLE PRECISION 00578 LANGUAGE sql 00579 AS $$ 00580 SELECT CASE WHEN -$1 < -37 THEN 1 00581 WHEN -$1 > 709 THEN 0 00582 ELSE 1 / (1 + exp(-$1)) 00583 END; 00584 $$; 00585