User Documentation
 All Files Functions Groups
multilogistic.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file multilogistic.sql_in
4  *
5  * @brief SQL functions for multinomial logistic regression
6  * @date July 2012
7  *
8  * @sa For a brief introduction to multinomial logistic regression, see the
9  * module description \ref grp_mlogreg.
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 @addtogroup grp_mlogreg
17 
18 @about
19 
20 Multinomial logistic regression is a widely used regression analysis tool that models the outcomes of categorical dependent
21 random variables (denoted \f$ Y \in \{ 0,1,2 \ldots k \} \f$). The models assumes that the conditional mean of the
22 dependant categorical variables is the logistic function of an affine combination of independent
23 variables (usually denoted \f$ \boldsymbol x \f$). That is,
24 \f[
25  E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x)
26 \f]
27 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where
28 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Multinomial Logistic
29 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes
30 the likelihood of the observations.
31 
32 Let
33 - \f$ \boldsymbol y \in \{ 0,1 \}^{n \times k} \f$ denote the vector of observed dependent
34  variables, with \f$ n \f$ rows and \f$ k \f$ columns, containing the observed values of the
35  dependent variable,
36 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$
37  columns and \f$ n \f$ rows, containing all observed vectors of independent
38  variables \f$ \boldsymbol x_i \f$ as rows.
39 
40 By definition,
41 \f[
42  P[Y = y_i | \boldsymbol x_i]
43  = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)
44  \,.
45 \f]
46 Maximizing the likelihood
47 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$
48 is equivalent to maximizing the log-likelihood
49 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to
50 \f[
51  l(\boldsymbol c) =
52  -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i}
53  \cdot \boldsymbol c^T \boldsymbol x_i))
54  \,.
55 \f]
56 The Hessian of this objective is \f$ H = -X^T A X \f$ where
57 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with
58 \f$
59  a_i = \sigma(\boldsymbol c^T \boldsymbol x)
60  \cdot
61  \sigma(-\boldsymbol c^T \boldsymbol x)
62  \,.
63 \f$
64 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex.
65 There are many techniques for solving convex optimization problems. Currently,
66 logistic regression in MADlib can use:
67 - Iteratively Reweighted Least Squares
68 
69 We estimate the standard error for coefficient \f$ i \f$ as
70 \f[
71  \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii}
72  \,.
73 \f]
74 The Wald z-statistic is
75 \f[
76  z_i = \frac{c_i}{\mathit{se}(c_i)}
77  \,.
78 \f]
79 
80 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
81 the assumptions inherent in the Wald test) of seeing a value at least as extreme
82 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
83 true. Letting \f$ F \f$ denote the cumulative density function of a standard
84 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
85 therefore
86 \f[
87  p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
88 \f]
89 where \f$ Z \f$ is a standard normally distributed random variable.
90 
91 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$.
92 
93 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration
94 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
95 the coefficients of the previous iteration). A large condition number (say, more
96 than 1000) indicates the presence of significant multicollinearity.
97 
98 The multinomial logistic regression uses a default reference category of zero,
99 and the regression coefficients in the output are in the order described below.
100 For a problem with
101 \f$ K \f$ dependent variables \f$ (1, ..., K) \f$ and \f$ J \f$ categories \f$ (0, ..., J-1)
102 \f$, let \f$ {m_{k,j}} \f$ denote the coefficient for dependent variable \f$ k
103 \f$ and category \f$ j \f$. The output is \f$ {m_{k_1, j_0}, m_{k_1, j_1}
104 \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1}, \ldots m_{k_2, j_{J-1}} \ldots m_{k_K, j_{J-1}}} \f$.
105 The order is NOT CONSISTENT with the multinomial regression marginal effect
106 calculation with function <em>marginal_mlogregr</em>. This is deliberate
107 because the interfaces of all multinomial regressions (robust, clustered, ...)
108 will be moved to match that used in marginal.
109 
110 
111 @input
112 
113 The training data is expected to be of the following form:\n
114 <pre>{TABLE|VIEW} <em>sourceName</em> (
115  ...
116  <em>dependentVariable</em> INTEGER,
117  <em>independentVariables</em> FLOAT8[],
118  ...
119 )</pre>
120 
121 @usage
122 - The number of independent variables cannot exceed 65535.
123 - The reference category ranges from [0, numCategories-1]. The default reference
124 category is zero.
125 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
126  statistics:\n
127  <pre>SELECT * FROM \ref mlogregr(
128  '<em>sourceName</em>',
129  '<em>dependentVariable</em>',
130  '<em>independentVariables</em>' [,
131  <em>numberOfIterations</em> [,
132  '<em>optimizer</em>' [,
133  <em>precision</em>, [,
134  <em>ref_category</em>]]]]
135 );</pre>
136  Output:
137 <pre>
138 ref_category | coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations\n ----+------+----------------+---------+---------+----------+-------------+--------------+---------------
139  ...
140 </pre>
141 - Get vector of coefficients \f$ \boldsymbol c \f$:\n
142  <pre>SELECT (\ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre>
143 - Get a subset of the output columns, e.g., only the array of coefficients
144  \f$ \boldsymbol c \f$, the log-likelihood of determination
145  \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
146  <pre>SELECT coef, log_likelihood, p_values
147 FROM \ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>');</pre>
148 
149 Note that the categories are encoded as integers with values from {0, 1, 2,..., numCategories-1}
150 @examp
151 
152 -# Create the sample data set:
153 @verbatim
154 sql> SELECT * FROM data;
155  r1 | val
156 ---------------------------------------------+-----
157  {1,3.01789340097457,0.454183579888195} | 1
158  {1,-2.59380532894284,0.602678326424211} | 0
159  {1,-1.30643094424158,0.151587064377964} | 1
160  {1,3.60722299199551,0.963550757616758} | 1
161  {1,-1.52197745628655,0.0782248834148049} | 1
162  {1,-4.8746574902907,0.345104880165309} | 0
163 ...
164 @endverbatim
165 -# Run the multi-logistic regression function:
166 @verbatim
167 sql> \x on
168 Expanded display is off.
169 sql> SELECT * FROM mlogregr('data', 'val', '2', 'r1', 100, 'irls', 0.001);
170 -[ RECORD 1 ]--+--------------------------------------------------------------
171 coef | {5.59049410898112,2.11077546770772,-0.237276684606453}
172 log_likelihood | -467.214718489873
173 std_err | {0.318943457652178,0.101518723785383,0.294509929481773}
174 z_stats | {17.5281667482197,20.7919819024719,-0.805666162169712}
175 p_values | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
176 odds_ratios | {267.867942976278,8.2546400100702,0.788773016471171}
177 condition_no | 179.186118573205
178 num_iterations | 9
179 
180 @endverbatim
181 
182 @literature
183 
184 A collection of nice write-ups, with valuable pointers into
185 further literature:
186 
187 [1] Annette J . Dobson: An Introduction to Generalized Linear Models, Second Edition. Nov 2001
188 
189 [2] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November
190  2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf
191 
192 [3] Srikrishna Sridhar, Mark Wellons, Caleb Welton: Multilogistic Regression:
193  Notes and References, Jul 12 2012, http://www.cs.wisc.edu/~srikris/mlogit.pdf
194 
195 [4] Scott A. Czepiel: Maximum Likelihood Estimation
196  of Logistic Regression Models: Theory and Implementation,
197  Retrieved Jul 12 2012, http://czep.net/stat/mlelr.pdf
198 
199 
200 
201 @sa File multilogistic.sql_in (documenting the SQL functions)
202 
203 @internal
204 @sa Namespace multilogistic (documenting the driver/outer loop implemented in
205  Python), Namespace
206  \ref madlib::modules::regress documenting the implementation in C++
207 @endinternal
208 
209 */
210 
211 
212 DROP TYPE IF EXISTS MADLIB_SCHEMA.mlogregr_result;
213 CREATE TYPE MADLIB_SCHEMA.mlogregr_result AS
214 (
215  ref_category INTEGER,
216  coef DOUBLE PRECISION[],
217  log_likelihood DOUBLE PRECISION,
218  std_err DOUBLE PRECISION[],
219  z_stats DOUBLE PRECISION[],
220  p_values DOUBLE PRECISION[],
221  odds_ratios DOUBLE PRECISION[],
222  condition_no DOUBLE PRECISION,
223  num_iterations INTEGER
224 );
225 
226 
227 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__mlogregr_irls_step_transition
228 (
229  state DOUBLE PRECISION[],
230  y INTEGER,
231  num_categories INTEGER,
232  ref_category INTEGER,
233  x DOUBLE PRECISION[],
234  prev_state DOUBLE PRECISION[]
235 )
236 RETURNS DOUBLE PRECISION[]
237 AS 'MODULE_PATHNAME'
238 LANGUAGE C IMMUTABLE;
239 
240 
241 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__mlogregr_irls_step_merge_states
242 (
243  state1 DOUBLE PRECISION[],
244  state2 DOUBLE PRECISION[]
245 )
246 RETURNS DOUBLE PRECISION[]
247 AS 'MODULE_PATHNAME'
248 LANGUAGE C IMMUTABLE STRICT;
249 
250 
251 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__mlogregr_irls_step_final
252 (
253  state DOUBLE PRECISION[]
254 )
255 RETURNS DOUBLE PRECISION[]
256 AS 'MODULE_PATHNAME'
257 LANGUAGE C IMMUTABLE STRICT;
258 
259 
260 /**
261  * @internal
262  * @brief Perform one iteration of the iteratively-reweighted-least-squares
263  * method for computing linear regression
264  */
265 CREATE AGGREGATE MADLIB_SCHEMA.__mlogregr_irls_step(
266  /*+ y */ INTEGER,
267  /*+ numCategories */ INTEGER,
268  /*+ ref_category */ INTEGER,
269  /*+ x */ DOUBLE PRECISION[],
270  /*+ previous_state */ DOUBLE PRECISION[]) (
271 
272  STYPE=DOUBLE PRECISION[],
273  SFUNC=MADLIB_SCHEMA.__mlogregr_irls_step_transition,
274  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.__mlogregr_irls_step_merge_states,')
275  FINALFUNC=MADLIB_SCHEMA.__mlogregr_irls_step_final,
276  INITCOND='{0,0,0,0,0}'
277 );
278 
279 
280 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__internal_mlogregr_irls_step_distance(
281  /*+ state1 */ DOUBLE PRECISION[],
282  /*+ state2 */ DOUBLE PRECISION[])
283 RETURNS DOUBLE PRECISION AS
284 'MODULE_PATHNAME'
285 LANGUAGE c IMMUTABLE STRICT;
286 
287 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__internal_mlogregr_irls_result(
288  /*+ state */ DOUBLE PRECISION[])
289 RETURNS MADLIB_SCHEMA.mlogregr_result AS
290 'MODULE_PATHNAME'
291 LANGUAGE c IMMUTABLE STRICT;
292 
293 
294 -- We only need to document the last one (unfortunately, in Greenplum we have to
295 -- use function overloading instead of default arguments).
296 CREATE FUNCTION MADLIB_SCHEMA.__compute_mlogregr
297 (
298  source VARCHAR,
299  depvar VARCHAR,
300  indepvar VARCHAR,
301  numcategories INTEGER,
302  maxnumiterations INTEGER,
303  optimizer VARCHAR,
304  "precision" DOUBLE PRECISION,
305  ref_category INTEGER
306 )
307 RETURNS INTEGER
308 AS $$
309  PythonFunction(regress, multilogistic, compute_mlogregr)
310 $$ LANGUAGE plpythonu VOLATILE STRICT;
311 
312 /**
313  * @brief Compute logistic-regression coefficients and diagnostic statistics
314  *
315  * To include an intercept in the model, set one coordinate in the
316  * <tt>independentVariables</tt> array to 1.
317  *
318  * @param source Name of the source relation containing the training data
319  * @param depvar Name of the dependent column (of type INTEGER < numcategories)
320  * @param indepvar Name of the independent column (of type DOUBLE PRECISION[])
321  * @param maxnumiterations The maximum number of iterations
322  * @param optimizer The optimizer to use (
323  * <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least
324  * squares)
325  * @param precision The difference between log-likelihood values in successive
326  * iterations that should indicate convergence. Note that a non-positive
327  * value here disables the convergence criterion, and execution will only
328  * stop after \ maxNumIterations iterations.
329  * @param ref_category The reference category specified by the user
330  *
331  * @return A composite value:
332  * - <tt>ref_category INTEGER</tt> - Reference category
333  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
334  * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$
335  * - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
336  * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
337  * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
338  * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
339  * - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios,
340  * \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$
341  * - <tt>condition_no FLOAT8</tt> - The condition number of matrix
342  * \f$ X^T A X \f$ during the iteration immediately <em>preceding</em>
343  * convergence (i.e., \f$ A \f$ is computed using the coefficients of the
344  * previous iteration)
345  * - <tt>num_iterations INTEGER</tt> - The number of iterations before the
346  * algorithm terminated
347  *
348  * @usage
349  * - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
350  * statistics:\n
351  * <pre>SELECT * FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
352  * '<em>numCategories</em>', '<em>independentVariables</em>');</pre>
353  * - Get vector of coefficients \f$ \boldsymbol c \f$:\n
354  * <pre>SELECT (mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
355  * '<em>numCategories</em>', '<em>independentVariables</em>')).coef;</pre>
356  * - Get a subset of the output columns, e.g., only the array of coefficients
357  * \f$ \boldsymbol c \f$, the log-likelihood of determination
358  * \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
359  * <pre>SELECT coef, log_likelihood, p_values
360  * FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
361  * '<em>numCategories</em>', '<em>independentVariables</em>');</pre>
362  *
363  * @note This function starts an iterative algorithm. It is not an aggregate
364  * function. Source and column names have to be passed as strings (due to
365  * limitations of the SQL syntax).
366  *
367  * @internal
368  * @sa This function is a wrapper for multilogistic::__compute_mlogregr(), which
369  * sets the default values.
370  */
371 CREATE FUNCTION MADLIB_SCHEMA.mlogregr
372 (
373  source VARCHAR,
374  depvar VARCHAR,
375  indepvar VARCHAR,
376  maxnumiterations INTEGER /*+ DEFAULT 20 */,
377  optimizer VARCHAR /*+ DEFAULT 'irls' */,
378  "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */,
379  ref_category INTEGER
380 )
381 RETURNS MADLIB_SCHEMA.mlogregr_result AS $$
382 DECLARE
383  observed_count INTEGER;
384  theIteration INTEGER;
385  fnName VARCHAR;
386  theResult MADLIB_SCHEMA.mlogregr_result;
387  numcategories INTEGER;
388  min_category INTEGER;
389  max_category INTEGER;
390 BEGIN
391  IF (source IS NULL OR trim(source) = '') THEN
392  RAISE EXCEPTION 'Invalid source table given';
393  END IF;
394 
395  IF (depvar IS NULL OR trim(depvar) = '') THEN
396  RAISE EXCEPTION 'Invalid depvar given';
397  END IF;
398  IF (indepvar IS NULL OR trim(indepvar) = '') THEN
399  RAISE EXCEPTION 'Invalid indepvar given';
400  END IF;
401  IF (maxnumiterations IS NULL OR maxnumiterations < 1) THEN
402  RAISE EXCEPTION 'Number of max iterations must be positive';
403  END IF;
404  IF (optimizer IS NULL OR trim(optimizer) = '') THEN
405  RAISE EXCEPTION 'Invalid optimizer given';
406  END IF;
407  IF (precision IS NULL) THEN
408  RAISE EXCEPTION 'Invalid precision given.';
409  END IF;
410  IF (ref_category IS NULL OR ref_category < 0) THEN
411  RAISE EXCEPTION 'Invalid ref_category given.';
412  END IF;
413 
414  IF (SELECT atttypid::regtype <> 'INTEGER'::regtype
415  FROM pg_attribute
416  WHERE attrelid = source::regclass AND attname = depvar) THEN
417  RAISE EXCEPTION 'The dependent variable column should be of type INTEGER';
418  END IF;
419 
420  EXECUTE $sql$ SELECT count(DISTINCT $sql$ || depvar || $sql$ )
421  FROM $sql$ || textin(regclassout(source))
422  INTO observed_count;
423  numcategories := observed_count;
424 
425  EXECUTE $sql$ SELECT max($sql$ || depvar || $sql$ )
426  FROM $sql$ || textin(regclassout(source))
427  INTO max_category;
428  EXECUTE $sql$ SELECT min($sql$ || depvar || $sql$ )
429  FROM $sql$ || textin(regclassout(source))
430  INTO min_category;
431 
432  IF max_category != numcategories - 1 OR min_category != 0 THEN
433  RAISE EXCEPTION 'The value of the dependent variable should be in the
434  range of [0, %)', numcategories;
435  END IF;
436 
437  IF ref_category > numcategories -1 OR ref_category < 0 THEN
438  RAISE EXCEPTION 'The value of the reference category should be in the
439  range of [0, "%")', numcategories;
440  END IF;
441 
442  IF optimizer = 'irls' OR optimizer = 'newton' THEN
443  fnName := '__internal_mlogregr_irls_result';
444  ELSE
445  RAISE EXCEPTION 'Unknown optimizer (''%''). Must be "newton" or "irls"', optimizer;
446  END IF;
447 
448  theIteration := (
449  SELECT MADLIB_SCHEMA.__compute_mlogregr(
450  $1, $2, $3, numcategories, $4, $5, $6, $7)
451  );
452 
453  -- Because of Greenplum bug MPP-10050, we have to use dynamic SQL (using
454  -- EXECUTE) in the following
455  -- Because of Greenplum bug MPP-6731, we have to hide the tuple-returning
456  -- function in a subquery
457  EXECUTE
458  $sql$
459  SELECT (result).*
460  FROM (
461  SELECT
462  MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result
463  FROM _madlib_iterative_alg
464  WHERE _madlib_iteration = $sql$ || theIteration || $sql$
465  ) subq
466  $sql$
467  INTO theResult;
468  -- The number of iterations are not updated in the C++ code. We do it here.
469  IF NOT (theResult IS NULL) THEN
470  theResult.num_iterations = theIteration;
471  END IF;
472  RETURN theResult;
473 END;
474 $$ LANGUAGE plpgsql VOLATILE;
475 
476 
477 CREATE FUNCTION MADLIB_SCHEMA.mlogregr
478 (
479  source VARCHAR,
480  depvar VARCHAR,
481  indepvar VARCHAR
482 )
483 RETURNS MADLIB_SCHEMA.mlogregr_result AS
484 $$
485  SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, 20, 'irls', 0.0001, 0);
486 $$ LANGUAGE sql VOLATILE;
487 
488 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
489  source VARCHAR,
490  depvar VARCHAR,
491  indepvar VARCHAR,
492  maxnumiterations INTEGER
493 )
494 RETURNS MADLIB_SCHEMA.mlogregr_result AS
495 $$
496  SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, 'irls', 0.0001, 0);
497 $$ LANGUAGE sql VOLATILE;
498 
499 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
500  source VARCHAR,
501  depvar VARCHAR,
502  indepvar VARCHAR,
503  maxbumiterations INTEGER,
504  optimizer VARCHAR
505 )
506 RETURNS MADLIB_SCHEMA.mlogregr_result AS
507 $$
508  SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, $5, 0.0001, 0);
509 $$ LANGUAGE sql VOLATILE;