User Documentation
 All Files Functions Groups
logistic.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file logistic.sql_in
4  *
5  * @brief SQL functions for logistic regression
6  * @date January 2011
7  *
8  * @sa For a brief introduction to logistic regression, see the
9  * module description \ref grp_logreg.
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4') --'
14 
15 /**
16 @addtogroup grp_logreg
17 
18 @about
19 
20 (Binomial) Logistic regression refers to a stochastic model in which the
21 conditional mean of the dependent dichotomous variable (usually denoted
22 \f$ Y \in \{ 0,1 \} \f$) is the logistic function of an affine function of the
23 vector of independent variables (usually denoted \f$ \boldsymbol x \f$). That
24 is,
25 \f[
26  E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x)
27 \f]
28 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where
29 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Logistic
30 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes
31 the likelihood of the observations.
32 
33 Let
34 - \f$ \boldsymbol y \in \{ 0,1 \}^n \f$ denote the vector of observed dependent
35  variables, with \f$ n \f$ rows, containing the observed values of the
36  dependent variable,
37 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$
38  columns and \f$ n \f$ rows, containing all observed vectors of independent
39  variables \f$ \boldsymbol x_i \f$ as rows.
40 
41 By definition,
42 \f[
43  P[Y = y_i | \boldsymbol x_i]
44  = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)
45  \,.
46 \f]
47 Maximizing the likelihood
48 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$
49 is equivalent to maximizing the log-likelihood
50 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to
51 \f[
52  l(\boldsymbol c) =
53  -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i}
54  \cdot \boldsymbol c^T \boldsymbol x_i))
55  \,.
56 \f]
57 The Hessian of this objective is \f$ H = -X^T A X \f$ where
58 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with
59 \f$
60  a_i = \sigma(\boldsymbol c^T \boldsymbol x)
61  \cdot
62  \sigma(-\boldsymbol c^T \boldsymbol x)
63  \,.
64 \f$
65 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex.
66 There are many techniques for solving convex optimization problems. Currently,
67 logistic regression in MADlib can use one of three algorithms:
68 - Iteratively Reweighted Least Squares
69 - A conjugate-gradient approach, also known as Fletcher-Reeves method in the
70  literature, where we use the Hestenes-Stiefel rule for calculating the step
71  size.
72 - Incremental gradient descent, also known as incremental gradient methods or
73  stochastic gradient descent in the literature.
74 
75 We estimate the standard error for coefficient \f$ i \f$ as
76 \f[
77  \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii}
78  \,.
79 \f]
80 The Wald z-statistic is
81 \f[
82  z_i = \frac{c_i}{\mathit{se}(c_i)}
83  \,.
84 \f]
85 
86 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
87 the assumptions inherent in the Wald test) of seeing a value at least as extreme
88 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
89 true. Letting \f$ F \f$ denote the cumulative density function of a standard
90 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
91 therefore
92 \f[
93  p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
94 \f]
95 where \f$ Z \f$ is a standard normally distributed random variable.
96 
97 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$.
98 
99 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration
100 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
101 the coefficients of the previous iteration). A large condition number (say, more
102 than 1000) indicates the presence of significant multicollinearity.
103 
104 
105 
106 
107 @input
108 
109 The training data for logistic regression is
110  expected to be of the following form:\n
111 <pre>{TABLE|VIEW} <em>sourceName</em> (
112  ...
113  <em>dependentVariable</em> BOOLEAN,
114  <em>independentVariables</em> FLOAT8[],
115  ...
116 )</pre>
117 
118 
119 @usage
120 The logistic regression has several options in what it can return:
121 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
122  statistics:\n
123  <pre>SELECT \ref logregr_train(
124  '<em>sourceName</em>', '<em>outName</em>', '<em>dependentVariable</em>',
125  '<em>independentVariables</em>'[, '<em>grouping_columns</em>',
126  [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em>
127  [, <em>verbose</em> ]] ] ] ]
128 );</pre>
129  Output table:
130  <pre>coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
131 -----+----------------+---------+---------+----------+-------------+--------------+---------------
132  ...
133 </pre>
134 - Get vector of coefficients \f$ \boldsymbol c \f$:\n
135  <pre>SELECT coef from outName; </pre>
136 - Get a subset of the output columns, e.g., only the array of coefficients
137  \f$ \boldsymbol c \f$, the log-likelihood of determination
138  \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
139  <pre>SELECT coef, log_likelihood, p_values FROM outName; </pre>
140 - By default, the option <em>verbose</em> is False. If it is set to be True, warning messages
141  will be output to the SQL client for groups that failed.
142 <pre>
143 
144 @examp
145 -# Create the sample data set:
146 @verbatim
147 sql> SELECT * FROM data;
148  r1 | val
149 ---------------------------------------------+-----
150  {1,3.01789340097457,0.454183579888195} | t
151  {1,-2.59380532894284,0.602678326424211} | f
152  {1,-1.30643094424158,0.151587064377964} | t
153  {1,3.60722299199551,0.963550757616758} | t
154  {1,-1.52197745628655,0.0782248834148049} | t
155  {1,-4.8746574902907,0.345104880165309} | f
156 ...
157 @endverbatim
158 -# Run the logistic regression function:
159 @verbatim
160 sql> \x on
161 Expanded display is off.
162 sql> SELECT logregr_train('data', 'out_tbl', 'val', 'r1', Null, 100, 'irls', 0.001);
163 sql> SELECT * from out_tbl;
164 coef | {5.59049410898112,2.11077546770772,-0.237276684606453}
165 log_likelihood | -467.214718489873
166 std_err | {0.318943457652178,0.101518723785383,0.294509929481773}
167 z_stats | {17.5281667482197,20.7919819024719,-0.805666162169712}
168 p_values | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
169 odds_ratios | {267.867942976278,8.2546400100702,0.788773016471171}
170 condition_no | 179.186118573205
171 num_iterations | 9
172 @endverbatim
173 
174 
175 @literature
176 
177 A somewhat random selection of nice write-ups, with valuable pointers into
178 further literature.
179 
180 [1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November
181  2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf
182 
183 [2] Thomas P. Minka: A comparison of numerical optimizers for logistic
184  regression, 2003 (revised Mar 26, 2007),
185  http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf
186 
187 [3] Paul Komarek, Andrew W. Moore: Making Logistic Regression A Core Data Mining
188  Tool With TR-IRLS, IEEE International Conference on Data Mining 2005,
189  pp. 685-688, http://komarix.org/ac/papers/tr-irls.short.pdf
190 
191 [4] D. P. Bertsekas: Incremental gradient, subgradient, and proximal methods for
192  convex optimization: a survey, Technical report, Laboratory for Information
193  and Decision Systems, 2010,
194  http://web.mit.edu/dimitrib/www/Incremental_Survey_LIDS.pdf
195 
196 [5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro: Robust stochastic
197  approximation approach to stochastic programming, SIAM Journal on
198  Optimization, 19(4), 2009, http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdf
199 
200 
201 
202 @sa File logistic.sql_in (documenting the SQL functions)
203 
204 @internal
205 @sa Namespace logistic (documenting the driver/outer loop implemented in
206  Python), Namespace
207  \ref madlib::modules::regress documenting the implementation in C++
208 @endinternal
209 </pre>
210 */
211 
212 DROP TYPE IF EXISTS MADLIB_SCHEMA.__logregr_result;
213 CREATE TYPE MADLIB_SCHEMA.__logregr_result AS (
214  coef DOUBLE PRECISION[],
215  log_likelihood DOUBLE PRECISION,
216  std_err DOUBLE PRECISION[],
217  z_stats DOUBLE PRECISION[],
218  p_values DOUBLE PRECISION[],
219  odds_ratios DOUBLE PRECISION[],
220  condition_no DOUBLE PRECISION,
221  status INTEGER,
222  num_iterations INTEGER
223 );
224 
225 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_transition(
226  DOUBLE PRECISION[],
227  BOOLEAN,
228  DOUBLE PRECISION[],
229  DOUBLE PRECISION[])
230 RETURNS DOUBLE PRECISION[]
231 AS 'MODULE_PATHNAME', 'logregr_cg_step_transition'
232 LANGUAGE C IMMUTABLE;
233 
234 ------------------------------------------------------------------------
235 
236 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_transition(
237  DOUBLE PRECISION[],
238  BOOLEAN,
239  DOUBLE PRECISION[],
240  DOUBLE PRECISION[])
241 RETURNS DOUBLE PRECISION[]
242 AS 'MODULE_PATHNAME', 'logregr_irls_step_transition'
243 LANGUAGE C IMMUTABLE;
244 
245 ------------------------------------------------------------------------
246 
247 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_transition(
248  DOUBLE PRECISION[],
249  BOOLEAN,
250  DOUBLE PRECISION[],
251  DOUBLE PRECISION[])
252 RETURNS DOUBLE PRECISION[]
253 AS 'MODULE_PATHNAME', 'logregr_igd_step_transition'
254 LANGUAGE C IMMUTABLE;
255 
256 ------------------------------------------------------------------------
257 
258 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_merge_states(
259  state1 DOUBLE PRECISION[],
260  state2 DOUBLE PRECISION[])
261 RETURNS DOUBLE PRECISION[]
262 AS 'MODULE_PATHNAME', 'logregr_cg_step_merge_states'
263 LANGUAGE C IMMUTABLE STRICT;
264 
265 ------------------------------------------------------------------------
266 
267 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_merge_states(
268  state1 DOUBLE PRECISION[],
269  state2 DOUBLE PRECISION[])
270 RETURNS DOUBLE PRECISION[]
271 AS 'MODULE_PATHNAME', 'logregr_irls_step_merge_states'
272 LANGUAGE C IMMUTABLE STRICT;
273 
274 ------------------------------------------------------------------------
275 
276 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_merge_states(
277  state1 DOUBLE PRECISION[],
278  state2 DOUBLE PRECISION[])
279 RETURNS DOUBLE PRECISION[]
280 AS 'MODULE_PATHNAME', 'logregr_igd_step_merge_states'
281 LANGUAGE C IMMUTABLE STRICT;
282 
283 ------------------------------------------------------------------------
284 
285 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_final(
286  state DOUBLE PRECISION[])
287 RETURNS DOUBLE PRECISION[]
288 AS 'MODULE_PATHNAME', 'logregr_cg_step_final'
289 LANGUAGE C IMMUTABLE STRICT;
290 
291 ------------------------------------------------------------------------
292 
293 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_final(
294  state DOUBLE PRECISION[])
295 RETURNS DOUBLE PRECISION[]
296 AS 'MODULE_PATHNAME', 'logregr_irls_step_final'
297 LANGUAGE C IMMUTABLE STRICT;
298 
299 ------------------------------------------------------------------------
300 
301 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_final(
302  state DOUBLE PRECISION[])
303 RETURNS DOUBLE PRECISION[]
304 AS 'MODULE_PATHNAME', 'logregr_igd_step_final'
305 LANGUAGE C IMMUTABLE STRICT;
306 
307 ------------------------------------------------------------------------
308 
309 /**
310  * @internal
311  * @brief Perform one iteration of the conjugate-gradient method for computing
312  * logistic regression
313  */
314 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_cg_step(
315  /*+ y */ BOOLEAN,
316  /*+ x */ DOUBLE PRECISION[],
317  /*+ previous_state */ DOUBLE PRECISION[]) (
318 
319  STYPE=DOUBLE PRECISION[],
320  SFUNC=MADLIB_SCHEMA.__logregr_cg_step_transition,
321  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_cg_step_merge_states,')
322  FINALFUNC=MADLIB_SCHEMA.__logregr_cg_step_final,
323  INITCOND='{0,0,0,0,0,0}'
324 );
325 
326 
327 /**
328  * @internal
329  * @brief Perform one iteration of the iteratively-reweighted-least-squares
330  * method for computing linear regression
331  */
332 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_irls_step(
333  /*+ y */ BOOLEAN,
334  /*+ x */ DOUBLE PRECISION[],
335  /*+ previous_state */ DOUBLE PRECISION[]) (
336 
337  STYPE=DOUBLE PRECISION[],
338  SFUNC=MADLIB_SCHEMA.__logregr_irls_step_transition,
339  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_irls_step_merge_states,')
340  FINALFUNC=MADLIB_SCHEMA.__logregr_irls_step_final,
341  INITCOND='{0,0,0,0}'
342 );
343 
344 ------------------------------------------------------------------------
345 
346 /**
347  * @internal
348  * @brief Perform one iteration of the incremental gradient
349  * method for computing logistic regression
350  */
351 CREATE AGGREGATE MADLIB_SCHEMA.__logregr_igd_step(
352  /*+ y */ BOOLEAN,
353  /*+ x */ DOUBLE PRECISION[],
354  /*+ previous_state */ DOUBLE PRECISION[]) (
355 
356  STYPE=DOUBLE PRECISION[],
357  SFUNC=MADLIB_SCHEMA.__logregr_igd_step_transition,
358  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__logregr_igd_step_merge_states,')
359  FINALFUNC=MADLIB_SCHEMA.__logregr_igd_step_final,
360  INITCOND='{0,0,0,0,0}'
361 );
362 
363 ------------------------------------------------------------------------
364 
365 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_step_distance(
366  /*+ state1 */ DOUBLE PRECISION[],
367  /*+ state2 */ DOUBLE PRECISION[])
368 RETURNS DOUBLE PRECISION AS
369 'MODULE_PATHNAME', 'internal_logregr_cg_step_distance'
370 LANGUAGE c IMMUTABLE STRICT;
371 
372 ------------------------------------------------------------------------
373 
374 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_cg_result(
375  /*+ state */ DOUBLE PRECISION[])
376 RETURNS MADLIB_SCHEMA.__logregr_result AS
377 'MODULE_PATHNAME', 'internal_logregr_cg_result'
378 LANGUAGE c IMMUTABLE STRICT;
379 
380 ------------------------------------------------------------------------
381 
382 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_step_distance(
383  /*+ state1 */ DOUBLE PRECISION[],
384  /*+ state2 */ DOUBLE PRECISION[])
385 RETURNS DOUBLE PRECISION AS
386 'MODULE_PATHNAME', 'internal_logregr_irls_step_distance'
387 LANGUAGE c IMMUTABLE STRICT;
388 
389 ------------------------------------------------------------------------
390 
391 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_irls_result(
392  /*+ state */ DOUBLE PRECISION[])
393 RETURNS MADLIB_SCHEMA.__logregr_result AS
394 'MODULE_PATHNAME', 'internal_logregr_irls_result'
395 LANGUAGE c IMMUTABLE STRICT;
396 
397 ------------------------------------------------------------------------
398 
399 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_step_distance(
400  /*+ state1 */ DOUBLE PRECISION[],
401  /*+ state2 */ DOUBLE PRECISION[])
402 RETURNS DOUBLE PRECISION AS
403 'MODULE_PATHNAME', 'internal_logregr_igd_step_distance'
404 LANGUAGE c IMMUTABLE STRICT;
405 
406 ------------------------------------------------------------------------
407 
408 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__logregr_igd_result(
409  /*+ state */ DOUBLE PRECISION[])
410 RETURNS MADLIB_SCHEMA.__logregr_result AS
411 'MODULE_PATHNAME', 'internal_logregr_igd_result'
412 LANGUAGE c IMMUTABLE STRICT;
413 
414 ------------------------------------------------------------------------
415 
416 /**
417  * @brief Compute logistic-regression coefficients and diagnostic statistics
418  *
419  * To include an intercept in the model, set one coordinate in the
420  * <tt>independentVariables</tt> array to 1.
421  *
422  * @param tbl_source Name of the source relation containing the training data
423  * @param tbl_output Name of the output relation to store the model results
424  * Columns of the output relation are as follows:
425  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
426  * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$
427  * - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
428  * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
429  * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
430  * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
431  * - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios,
432  * \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$
433  * - <tt>condition_no FLOAT8</tt> - The condition number of
434  * matrix \f$ X^T A X \f$ during the iteration
435  * immediately <em>preceding</em> convergence
436  * (i.e., \f$ A \f$ is computed using the coefficients
437  * of the previous iteration)
438  * @param dep_col Name of the dependent column (of type BOOLEAN)
439  * @param ind_col Name of the independent column (of type DOUBLE
440  * PRECISION[])
441  * @param grouping_col Comma delimited list of column names to group-by
442  * @param max_iter The maximum number of iterations
443  * @param optimizer The optimizer to use (either
444  * <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least
445  * squares or <tt>'cg'</tt> for conjugent gradient)
446  * @param tolerance The difference between log-likelihood values in successive
447  * iterations that should indicate convergence. This value should be
448  * non-negative and a zero value here disables the convergence criterion,
449  * and execution will only stop after \c maxNumIterations iterations.
450  * @param verbose If true, any error or warning message will be printed to the
451  * console (irrespective of the 'client_min_messages' set by server).
452  * If false, no error/warning message is printed to console.
453  *
454  *
455  * @usage
456  * - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
457  * statistics:\n
458  * <pre>SELECT logregr_train('<em>sourceName</em>', '<em>outName</em>'
459  * '<em>dependentVariable</em>', '<em>independentVariables</em>');
460  * SELECT * from outName;
461  * </pre>
462  * - Get vector of coefficients \f$ \boldsymbol c \f$:\n
463  * <pre>SELECT coef from outName;</pre>
464  * - Get a subset of the output columns, e.g., only the array of coefficients
465  * \f$ \boldsymbol c \f$, the log-likelihood of determination
466  * \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
467  * <pre>SELECT coef, log_likelihood, p_values FROM outName;</pre>
468  *
469  * @note This function starts an iterative algorithm. It is not an aggregate
470  * function. Source, output, and column names have to be passed as strings
471  * (due to limitations of the SQL syntax).
472  *
473  * @internal
474  * @sa This function is a wrapper for logistic::compute_logregr(), which
475  * sets the default values.
476  */
477 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logregr_train (
478  tbl_source VARCHAR,
479  tbl_output VARCHAR,
480  dep_col VARCHAR,
481  ind_col VARCHAR,
482  grouping_col VARCHAR,
483  max_iter INTEGER,
484  optimizer VARCHAR,
485  tolerance DOUBLE PRECISION,
486  verbose BOOLEAN
487 ) RETURNS VOID AS $$
488 PythonFunction(regress, logistic, logregr_train)
489 $$ LANGUAGE plpythonu;
490 
491 ------------------------------------------------------------------------
492 
493 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
494  tbl_source VARCHAR,
495  tbl_output VARCHAR,
496  dep_col VARCHAR,
497  ind_col VARCHAR)
498 RETURNS VOID AS $$
499  SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, NULL::VARCHAR, 20, 'irls', 0.0001, False);
500 $$ LANGUAGE sql VOLATILE;
501 
502 ------------------------------------------------------------------------
503 
504 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
505  tbl_source VARCHAR,
506  tbl_output VARCHAR,
507  dep_col VARCHAR,
508  ind_col VARCHAR,
509  grouping_col VARCHAR)
510 RETURNS VOID AS $$
511  SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, 20, 'irls', 0.0001, False);
512 $$LANGUAGE sql VOLATILE;
513 
514 ------------------------------------------------------------------------
515 
516 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
517  tbl_source VARCHAR,
518  tbl_output VARCHAR,
519  dep_col VARCHAR,
520  ind_col VARCHAR,
521  grouping_col VARCHAR,
522  max_iter INTEGER)
523 RETURNS VOID AS $$
524  SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, 'irls', 0.0001, False);
525 $$LANGUAGE sql VOLATILE;
526 
527 ------------------------------------------------------------------------
528 
529 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
530  tbl_source VARCHAR,
531  tbl_output VARCHAR,
532  dep_col VARCHAR,
533  ind_col VARCHAR,
534  grouping_col VARCHAR,
535  max_iter INTEGER,
536  optimizer VARCHAR)
537 RETURNS VOID AS $$
538  SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, 0.0001, False);
539 $$ LANGUAGE sql VOLATILE;
540 
541 ------------------------------------------------------------------------
542 
543 CREATE FUNCTION MADLIB_SCHEMA.logregr_train (
544  tbl_source VARCHAR,
545  tbl_output VARCHAR,
546  dep_col VARCHAR,
547  ind_col VARCHAR,
548  grouping_col VARCHAR,
549  max_iter INTEGER,
550  optimizer VARCHAR,
551  tolerance DOUBLE PRECISION)
552 RETURNS VOID AS $$
553  SELECT MADLIB_SCHEMA.logregr_train($1, $2, $3, $4, $5, $6, $7, $8, False);
554 $$ LANGUAGE sql VOLATILE;
555 
556 ------------------------------------------------------------------------
557 
558 /**
559  * @brief Evaluate the usual logistic function in an under-/overflow-safe way
560  *
561  * @param x
562  * @returns \f$ \frac{1}{1 + \exp(-x)} \f$
563  *
564  * Evaluating this expression directly can lead to under- or overflows.
565  * This function performs the evaluation in a safe manner, making use of the
566  * following observations:
567  *
568  * In order for the outcome of \f$ \exp(x) \f$ to be within the range of the
569  * minimum positive double-precision number (i.e., \f$ 2^{-1074} \f$) and the
570  * maximum positive double-precision number (i.e.,
571  * \f$ (1 + (1 - 2^{52})) * 2^{1023}) \f$, \f$ x \f$ has to be within the
572  * natural logarithm of these numbers, so roughly in between -744 and 709.
573  * However, \f$ 1 + \exp(x) \f$ will just evaluate to 1 if \f$ \exp(x) \f$ is
574  * less than the machine epsilon (i.e., \f$ 2^{-52} \f$) or, equivalently, if
575  * \f$ x \f$ is less than the natural logarithm of that; i.e., in any case if
576  * \f$ x \f$ is less than -37.
577  * Note that taking the reciprocal of the largest double-precision number will
578  * not cause an underflow. Hence, no further checks are necessary.
579  */
580 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.logistic(x DOUBLE PRECISION)
581 RETURNS DOUBLE PRECISION
582 LANGUAGE sql
583 AS $$
584  SELECT CASE WHEN -$1 < -37 THEN 1
585  WHEN -$1 > 709 THEN 0
586  ELSE 1 / (1 + exp(-$1))
587  END;
588 $$;
589