User Documentation
 All Files Functions Groups
marginal.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file marginal.sql_in
4  *
5  * @brief SQL functions for linear regression
6  * @date January 2011
7  *
8  * @sa Calculates marginal effects for various regression models.
9  *
10  *//* ----------------------------------------------------------------------- */
11 
12 m4_include(`SQLCommon.m4')
13 
14 
15 /**
16 @addtogroup grp_marginal
17 
18 @about
19 
20 A marginal effect (ME) or partial effect measures the effect on the
21 conditional mean of \f$ y \f$ of a change in one of the regressors, say
22 \f$X_k\f$. In the linear regression model, the ME equals the
23 relevant slope coefficient, greatly simplifying analysis. For nonlinear models,
24 we require specialized algorithms for calculating ME.
25 
26 
27 The standard approach to modeling dichotomous/binary variables
28 (so \f$y \in \{0, 1\} \f$) is to estimate a generalized linear model under the
29 assumption that \f$ y \f$ follows some form of Bernoulli distribution. Thus the expected
30 value of \f$ y \f$ becomes,
31 \f[
32  y = G(X' \beta),
33 \f]
34 
35 where G is the specified binomial distribution. For
36 logistic regression, the funciton \f$ G \f$ represents the inverse logit function.
37 
38 In logistic regression:
39 \f[
40 P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}}
41  = \frac{1}{1 + e^{-z}}
42  \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot
43  \frac{e^{-z}}{1 + e^{-z}} \\
44  = \beta_k \cdot P \cdot (1-P)
45 \f]
46 
47 There are several methods for calculating the marginal effects for dichotomous
48 dependent variables. This package uses the average of the marginal effects at
49 every sample observation.
50 
51 This is calculated as follows:
52 \f[
53  \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\
54  \text{where}, P(y_i=1) = g(X^{(i)}\beta)
55 \f]
56 
57 We use the delta method for calculating standard errors on the marginal effects.
58 
59 @input
60 
61 The input parameters are expected to be of the following form:
62 <pre> <em>margins_logregr</em> (
63  <em>sourceTable</em> VARCHAR,
64  <em>outputTable</em> VARCHAR,
65  <em>dependentVariable</em> VARCHAR,
66  <em>independentVariable</em> VARCHAR,
67  [<em>groupingCol</em> VARCHAR,
68  <em>marginal_vars</em> INTEGER[],
69  <em>max_iter</em> INTEGER,
70  <em>optimizer</em> VARCHAR,
71  <em>tolerance</em> DOUBLE PRECISION
72  ]
73 )</pre>
74 
75 For multinomial logistic regression the input parameters are expected to be of the following form:
76 <pre> <em>margins_mlogregr</em> (
77  <em>sourceTable</em> VARCHAR,
78  <em>outputTable</em> VARCHAR,
79  <em>dependentVariable</em> VARCHAR,
80  <em>independentVariable</em> VARCHAR,
81  [<em>groupingCol</em> VARCHAR,
82  <em>referenceCategory</em> INTEGER,
83  <em>marginal_vars</em> INTEGER[],
84  <em>max_iter</em> INTEGER,
85  <em>optimizer</em> VARCHAR,
86  <em>tolerance</em> DOUBLE PRECISION
87  ]
88 )</pre>
89 
90 \warning The \b 'groupingCol' input parameter for \e margins_logregr and \e margins_mlogregr
91 is a placeholder in the Madlib V1.0. These inputs will be implemented in a future release.
92 @usage
93 
94 <b> The Interface</b>
95 
96 For logistic regression, one can call the following function
97 <pre>
98 SELECT madlib.\ref margins_logregr(
99  <em>'source_table'</em>, -- name of input table, VARCHAR
100  <em>'out_table'</em>, -- name of output table, VARCHAR
101  <em>'dependent_varname'</em>, -- dependent variable, VARCHAR
102  <em>'independent_varname'</em>, -- independent variable, VARCHAR
103  [ <em>'grouping_cols'</em>, -- comma separated list of grouping vars, VARCHAR (Default NULL)
104  <em>'marginal_effect_variables'</em>, -- Index list (base 1) with positions representing which marginal variable to calculate, INTEGER[] (Default NULL)
105  <em> max_iter</em>, -- Maximum number of iterations to run the logistic regression INTEGER (Default 20)
106  <em>'optimizer'</em>, -- Optimizer used for logistic regression VARCHAR (Default & recommended 'irls')
107  <em>tolerance</em>, -- Tolerance for the logistic regression DOUBLE PRECISION (default 1e-4)
108  ]
109 );
110 </pre>
111 
112 For multinomial logistic regression, one can call the following function
113 <pre>
114 SELECT madlib.\refmargins_mlogregr(
115  <em>'source_table'</em>, -- name of input table, VARCHAR
116  <em>'out_table'</em>, -- name of output table, VARCHAR
117  <em>'dependent_varname'</em>, -- dependent variable, VARCHAR
118  <em>'independent_varname'</em>, -- independent variable, VARCHAR
119  [ <em>'reference_category'</em>, -- Reference category for multinomial logistic regression INTEGER
120  <em>'grouping_cols'</em>, -- comma separated list of grouping vars, VARCHAR (Default NULL)
121  <em>'marginal_effect_variables'</em>, -- Index list (base 1) with positions representing which marginal variable to calculate, INTEGER[] (Default NULL)
122  <em> max_iter</em>, -- Maximum number of iterations to run the logistic regression INTEGER (Default 20)
123  <em>'optimizer'</em>, -- Optimizer used for logistic regression VARCHAR (Default & recommended 'irls')
124  <em>tolerance</em>, -- Tolerance for the logistic regression DOUBLE PRECISION (default 1e-4)
125  ]
126 );
127 </pre>
128 
129 Output is stored in the <em>out_table</em>:
130 <pre>
131 [ margins | std_err | t_stats | p_values |
132 +------+---------+-------+----------+
133 </pre>
134 
135 <b> Multinomial-Logistic Regression Notes</b>
136 
137 - The reference category ranges from [0, numCategories-1]. The default reference
138 category is zero.
139 
140 - The marginal effects and supporting statistics are output in the following order.
141 For a problem with K dependent variables (1, ..., K) and J categories (0,..., J-1).
142 The output is \f$ \{ m_{k_1, j_0} \ldots m_{k_K, j_{0}},
143 m_{k_1, j_1} \ldots m_{k_K, j_{1}}, m_{k_1, j_{J-1}} \ldots m_{k_K, j_{J-1}} \} \f$.
144 The order is NOT CONSISTENT with mlogregr, robust_variance_mlogregr and clustered_variance_mlogregr.
145 This is deliberate because the interfaces of all the other functions will be moved
146 to match that used in marginal.
147 
148 - Selectively picking the marginal effects of the independent variables will return
149 the marginal effects of ALL categories associated with the independent variable.
150 
151 - Here the <em>'independent_varname'</em> can be the name of a column, which contains
152 array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]',
153 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
154 
155 - Here the <em>'vars'</em> is an index list (base 1) with the set of indices
156 amongst the independent variables that must be selected.
157 One can also chose <em>NULL<\em> to pick all indepependent variables.
158 
159 @examp
160 -# Create the sample data set:
161 @verbatim
162 sql> SELECT * FROM data;
163  id | second_attack | treatment | trait_anxiety
164 ----+---------------+-----------+---------------
165  1 | 1 | 1 | 70
166  3 | 1 | 1 | 50
167  5 | 1 | 0 | 40
168  7 | 1 | 0 | 75
169  9 | 1 | 0 | 70
170  11 | 0 | 1 | 65
171  13 | 0 | 1 | 45
172  15 | 0 | 1 | 40
173  17 | 0 | 0 | 55
174 ...
175 @endverbatim
176 -# For function summary information. Run
177 @verbatim
178 sql> select margins_logregr('help');
179 OR
180 sql> select margins_logregr();
181 OR
182 sql> select margins_logregr('');
183 @endverbatim
184 
185 -# Run the logistic regression function and then compute the marginal effects of all variables in the regression:
186 @verbatim
187 sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]');
188 sql> select * from result_table;
189 margins | {-0.970665392796,-0.156214190168,0.0181587690137}
190 coef | {-6.36346994178179,-1.02410605239327,0.119044916668605}
191 std_err | {0.802871454422,0.292691682191,0.0137459874022}
192 t_stats | {-1.2089922832,-0.533715850748,1.32102325446}
193 p_values | {0.243212810329,0.600447858606,0.204000202116}
194 
195 
196 -# Alternate Syntax: Run the logistic regression function and then compute the marginal effects if all variables in the regression:
197 sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]', NULL, NULL);
198 sql> select * from result_table;
199 margins | {-0.970665392796,-0.156214190168,0.0181587690137}
200 coef | {-6.36346994178179,-1.02410605239327,0.119044916668605}
201 std_err | {0.802871454422,0.292691682191,0.0137459874022}
202 t_stats | {-1.2089922832,-0.533715850748,1.32102325446}
203 p_values | {0.243212810329,0.600447858606,0.204000202116}
204 
205 
206 -# Run the logistic regression function and then compute the marginal effects of the first variable in the regression
207 sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]', NULL, ARRAY[1]);
208 sql> select * from result_table;
209 margins | {-0.970665392796}
210 coef | {-6.36346994178179}
211 std_err | {0.802871454422}
212 t_stats | {-1.2089922832}
213 p_values | {0.243212810329}
214 @endverbatim
215 
216 @examp
217 -# Create the sample data set (the full dataset has 3 categories):
218 @verbatim
219 sql> SELECT * FROM data;
220  id | feature_1 | feature_2 | category
221 ----+---------------+-----------+---------------
222  1 | 1 | 35 | 1
223  3 | 2 | 33 | 0
224  5 | 3 | 39 | 1
225  7 | 1 | 37 | 1
226  9 | 2 | 31 | 1
227  11 | 3 | 36 | 1
228  13 | 2 | 36 | 1
229  15 | 2 | 36 | 0
230  17 | 2 | 31 | 5
231 ...
232 @endverbatim
233 -# For function summary information. Run
234 @verbatim
235 sql> select margins_mlogregr('help');
236 OR
237 sql> select margins_mlogregr();
238 OR
239 sql> select margins_mlogregr('');
240 @endverbatim
241 
242 -# Run the regression function and then compute the marginal effects of all variables in the regression (see docs for detailed order)
243 @verbatim
244 sql> select margins_mlogregr('test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]');
245 sql> select * from result_table;
246 margins | {0.741613239156,-0.032868883552,-0.0144502990691,-0.972055011831,0.112337273885,0.0172621628253}
247 std_err | {0.183172236055,0.044184899499,0.00332608999704,0.263532615748,0.0555196094594,0.00457999429836}
248 t_stats | {4.04872078394,-0.743894043547,-4.34453038911,-3.68855676202,2.02338011702,3.76903587663}
249 p_values | {7.43784735554e-05,0.457840607871,2.24855476205e-05,0.000292799037776,0.0444060346517,0.000217384008015}
250 @endverbatim
251 
252 -# Run the regression and compute the marginals effects for the first dependent variable (all categories)
253 @verbatim
254 sql> select margins_mlogregr('test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]', 0, NULL, ARRAY[1]);
255 sql> select * from result_table;
256 margins | {0.741613239156,-0.972055011831}
257 std_err | {0.183172236055,0.263532615748}
258 t_stats | {4.04872078394,-3.68855676202}
259 p_values | {7.43784735554e-05,0.000292799037776}
260 @endverbatim
261 
262 @verbatim
263 -# For function usage information.
264 sql> select margins_mlogregr('usage');
265 @endverbatim
266 
267 @literature
268 
269 
270 [1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option
271 
272 @sa File marginal.sql_in documenting the SQL functions.
273 
274 @internal
275 @sa Namespace \ref madlib::modules::regress
276  documenting the implementation in C++
277 @endinternal
278 */
279 
280 
281 
282 
283 ------------------ Marginal Logistic Regression ------------------------------
284 
285 CREATE TYPE MADLIB_SCHEMA.marginal_logregr_result AS (
286  margins DOUBLE PRECISION[],
287  coef DOUBLE PRECISION[],
288  std_err DOUBLE PRECISION[],
289  t_stats DOUBLE PRECISION[],
290  p_values DOUBLE PRECISION[]
291 );
292 
293 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.marginal_logregr_step_transition(
294  state DOUBLE PRECISION[],
295  y BOOLEAN,
296  x DOUBLE PRECISION[],
297  coef DOUBLE PRECISION[])
298 RETURNS DOUBLE PRECISION[]
299 AS 'MODULE_PATHNAME'
300 LANGUAGE C IMMUTABLE;
301 
302 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.marginal_logregr_step_merge_states(
303  state1 DOUBLE PRECISION[],
304  state2 DOUBLE PRECISION[])
305 RETURNS DOUBLE PRECISION[]
306 AS 'MODULE_PATHNAME'
307 LANGUAGE C IMMUTABLE STRICT;
308 
309 
310 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.marginal_logregr_step_final(
311  state DOUBLE PRECISION[])
312 RETURNS MADLIB_SCHEMA.marginal_logregr_result
313 AS 'MODULE_PATHNAME'
314 LANGUAGE C IMMUTABLE STRICT;
315 
316 
317 /**
318  * @brief Compute marginal effects for logistic regression.
319  *
320  * @param dependentVariable Column containing the dependent variable
321  * @param independentVariables Column containing the array of independent variables
322  * @param coef Column containing the array of the coefficients (as obtained by logregr)
323  *
324  * @par
325  * To include an intercept in the model, set one coordinate in the
326  * <tt>independentVariables</tt> array to 1.
327  *
328  * @return A composite value:
329  * - <tt>margins FLOAT8[] </tt> - Array of marginal effects
330  * - <tt>coefFLOAT8[] </tt> - The coefficients for the regression
331  * - <tt>std_err FLOAT8[]</tt> - Array of standard-errors (calculated by the delta method),
332  * - <tt>t_stats FLOAT8[]</tt> - Array of t-statistics
333  * - <tt>p_values FLOAT8[]</tt> - Array of p-values
334  *
335  * @usage
336  * - Get all the diagnostic statistics:\n
337  *
338  * <pre> SELECT marginal_logregr(<em>dependentVariable</em>,
339  * <em>independentVariables</em>, coef)
340  * FROM <em>dataTable</em>;
341  * </pre>
342  */
343 
344 CREATE AGGREGATE MADLIB_SCHEMA.marginal_logregr(
345  /*+ "dependentVariable" */ BOOLEAN,
346  /*+ "independentVariables" */ DOUBLE PRECISION[],
347  /*+ "coef" */ DOUBLE PRECISION[] )(
348  STYPE=DOUBLE PRECISION[],
349  SFUNC=MADLIB_SCHEMA.marginal_logregr_step_transition,
350  m4_ifdef(`__GREENPLUM__',`PREFUNC=MADLIB_SCHEMA.marginal_logregr_step_merge_states,')
351  FINALFUNC=MADLIB_SCHEMA.marginal_logregr_step_final,
352  INITCOND='{0,0,0,0,0,0}'
353 );
354 
355 
356 --------------------------- Interface ----------------------------------
357 
358 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
359  usage_string VARCHAR -- usage string
360 )
361 RETURNS VARCHAR AS $$
362 DECLARE
363 insert_string VARCHAR;
364 BEGIN
365  IF (usage_string = '' OR usage_string = 'help' OR usage_string = '?') THEN
366  insert_string := '' ||
367  E'Summary \n' ||
368  E'-----------------------------------------------------------------------------------------\n' ||
369  E' Functionality: Calculate marginal effects for logistic regression\n' ||
370  E' The funciton first runs the regression and then calculates the \n' ||
371  E' coefficients and uses them to calculate the marginal effects \n' ||
372  E' SELECT {schema_madlib}.margins_logregr(''source_table'' \n' ||
373  E' ,''output_table'' \n' ||
374  E' ,''dependent_variable'' \n' ||
375  E' ,''independent_variable'' \n' ||
376  E' [,''grouping_cols'' \n' ||
377  E' , marginal_variables \n' ||
378  E' , max_iter \n' ||
379  E' ,''optimizer'' \n' ||
380  E' , tolerance \n' ||
381  E' , verbose_mode \n' ||
382  E' ]' ||
383  E' );\n' ||
384  E'For more details on function usage: \n' ||
385  E'SELECT {schema_madlib}.margins_logregr(''usage'') \n' ||
386  E'';
387  ElSIF (usage_string = 'usage') THEN
388  insert_string := '' ||
389  E'-----------------------------------------------------------------------------------------\n' ||
390  E'Usage\n' ||
391  E'-----------------------------------------------------------------------------------------\n' ||
392  E' SELECT {schema_madlib}.margins_logregr( \n' ||
393  E' ''source_table'', -- Name of data table \n' ||
394  E' ''output_table'', -- Name of result table (overwrites if exists) \n' ||
395  E' ''dependent_variable'', -- Name of column for dependent variables\n' ||
396  E' ''independent_variable'', -- Name of column for independent variables\n' ||
397  E' (can be any SQL expression that evaluates to an array)\n' ||
398  E'[ ''group_cols'', -- Comma separated string with columns to group by\n' ||
399  E' (Optional : Default NULL => no grouping) \n' ||
400  E' ''margins_vars'', -- Index list (base 1) representing the independent variables to compute marginal effects on \n' ||
401  E' (Optional : Default NULL which computes marginal effects on all variables) \n' ||
402  E' max_iter, -- Maximum number of iterations for the logistic regression\n' ||
403  E' (Optional : Default 20) \n' ||
404  E' ''optimizer'', -- Optimizer for logistic regression (newton/irls, cg or igd)\n' ||
405  E' (Optional : Default irls) \n' ||
406  E' tolerance, -- Termination criterion for logistic regression (relative)\n' ||
407  E' (Optional : Default 1e-4) \n' ||
408  E' verbose_mode, -- Verbose mode \n' ||
409  E' (Optional : Default False) \n' ||
410  E']' ||
411  E' );\n' ||
412  E'\n' ||
413  E'-----------------------------------------------------------------------------------------\n' ||
414  E'Output:\n' ||
415  E'-----------------------------------------------------------------------------------------\n' ||
416  E' The output table (''output_table'' above) has the following columns\n' ||
417  E' ''margins'' DOUBLE PRECISION[], -- Marginal effects \n' ||
418  E' ''std_err'' DOUBLE PRECISION[], -- Standard errors using the delta method\n' ||
419  E' ''t_stats'' DOUBLE PRECISION[], -- T-stats of the marginal effects\n' ||
420  E' ''p_values'' DOUBLE PRECISION[], -- p-values of the marginal effects\n' ||
421  E'\n' ||
422  E'';
423  ELSE
424  insert_string := 'No such option. Run SELECT {schema_madlib}.margins_logregr()';
425  END IF;
426  RETURN insert_string;
427 END;
428 $$ LANGUAGE plpgsql VOLATILE;
429 
430 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr()
431 RETURNS VARCHAR AS $$
432 BEGIN
433  RETURN MADLIB_SCHEMA.margins_logregr('');
434 END;
435 $$ LANGUAGE plpgsql VOLATILE;
436 
437 
438 /**
439  @brief A wrapper function for the various marginal regression analyzes.
440  *
441  * @param source_table String identifying the input table
442  * @param out_table String identifying the output table to be created
443  * @param dependent_varname Column containing the dependent variable
444  * @param input_independent_varname Column containing the array of independent variables
445  * @param grouping_cols Set of columns to group by.
446  * @param marginal_vars Subset of independent variables to calculate marginal effects for.
447  * @param max_iter Maximum number of iterations
448  * @param optimzer Optimizer to be used (newton/irls, cg or idg)
449  * @param tolerance Resiual tolerance
450  * @param verbose_mode Verbose mode (on/off)
451  *
452  * @par
453  * To include an intercept in the model, set one coordinate in the
454  * <tt>independentVariables</tt> array to 1.
455  *
456  * @return void
457  *
458  * @usage
459  * For function summary information. Run
460  * sql> select margins_logregr('help');
461  * OR
462  * sql> select margins_logregr();
463  * OR
464  * sql> select margins_logregr('?');
465  * For function usage information. Run
466  * sql> select margins_logregr('usage');
467  * - Compute the coefficients, and the get the marginal diagnostic statistics:
468  * <pre>
469  * select margins_logregr(source_table, out_table, dependentVariable, independentVariables);
470  * </pre>
471  *
472  */
473 
474 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
475  source_table VARCHAR -- name of input table
476  , out_table VARCHAR -- name of output table
477  , dependent_varname VARCHAR -- name of dependent variable
478  , input_independent_varname VARCHAR -- name of independent variable
479  , input_group_cols VARCHAR -- names of columns to group by
480  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
481  , max_iter INTEGER -- Max iterations for the logstic regression inner call
482  , optimizer VARCHAR -- Logistic regression optimizer
483  , tolerance DOUBLE PRECISION -- Tolerance
484  , verbose_mode BOOLEAN -- Verbose mode
485  )
486 RETURNS VOID AS $$
487 PythonFunction(regress, marginal, margins_logregr)
488 $$ LANGUAGE plpythonu;
489 
490 
491 
492 -- Default Variable calls for margins_logregr
493 ------------------------------------------------------------------------------
494 
495 /**
496  * @brief Marginal effects with default variables
497  **/
498 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
499  source_table VARCHAR -- name of input table
500  , out_table VARCHAR -- name of output table
501  , dependent_variable VARCHAR -- name of dependent variable
502  , independent_variable VARCHAR -- name of independent variable
503  )
504 RETURNS VOID AS $$
505 BEGIN
506  PERFORM MADLIB_SCHEMA.margins_logregr(
507  source_table,
508  out_table,
509  dependent_variable,
510  independent_variable,
511  NULL,
512  NULL,
513  20,
514  'irls',
515  0.0001,
516  False);
517 END;
518 $$ LANGUAGE plpgsql VOLATILE;
519 
520 /**
521  * @brief Marginal effects with default variable_names
522  **/
523 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
524  source_table VARCHAR -- name of input table
525  , out_table VARCHAR -- name of output table
526  , dependent_variable VARCHAR -- name of dependent variable
527  , independent_variable VARCHAR -- name of independent variable
528  , grouping_cols VARCHAR -- name of grouping cols
529  )
530 RETURNS VOID AS $$
531 BEGIN
532  PERFORM MADLIB_SCHEMA.margins_logregr(
533  source_table,
534  out_table,
535  dependent_variable,
536  independent_variable,
537  grouping_cols,
538  NULL,
539  20,
540  'irls',
541  0.0001,
542  False);
543 END;
544 $$ LANGUAGE plpgsql VOLATILE;
545 
546 /**
547  * @brief Marginal effects with default variable_names
548  **/
549 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
550  source_table VARCHAR -- name of input table
551  , out_table VARCHAR -- name of output table
552  , dependent_variable VARCHAR -- name of dependent variable
553  , independent_variable VARCHAR -- name of independent variable
554  , grouping_cols VARCHAR -- name of grouping cols
555  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
556  )
557 RETURNS VOID AS $$
558 BEGIN
559  PERFORM MADLIB_SCHEMA.margins_logregr(
560  source_table,
561  out_table,
562  dependent_variable,
563  independent_variable,
564  grouping_cols,
565  marginal_vars,
566  20,
567  'irls',
568  0.0001,
569  False);
570 END;
571 $$ LANGUAGE plpgsql VOLATILE;
572 
573 /**
574  * @brief Marginal effects with default variable_names
575  **/
576 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
577  source_table VARCHAR -- name of input table
578  , out_table VARCHAR -- name of output table
579  , dependent_variable VARCHAR -- name of dependent variable
580  , independent_variable VARCHAR -- name of independent variable
581  , grouping_cols VARCHAR -- name of grouping cols
582  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
583  , max_iter INTEGER -- Max iterations for the logstic regression inner call
584  )
585 RETURNS VOID AS $$
586 BEGIN
587  PERFORM MADLIB_SCHEMA.margins_logregr(
588  source_table,
589  out_table,
590  dependent_variable,
591  independent_variable,
592  grouping_cols,
593  marginal_vars,
594  max_iter,
595  'irls',
596  0.0001,
597  False);
598 END;
599 $$ LANGUAGE plpgsql VOLATILE;
600 
601 
602 /**
603  * @brief Marginal effects with default variable_names
604  **/
605 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
606  source_table VARCHAR -- name of input table
607  , out_table VARCHAR -- name of output table
608  , dependent_variable VARCHAR -- name of dependent variable
609  , independent_variable VARCHAR -- name of independent variable
610  , grouping_cols VARCHAR -- name of grouping cols
611  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
612  , max_iter INTEGER -- Max iterations for the logstic regression inner call
613  , optimizer VARCHAR -- Logistic regression optimizer
614  )
615 RETURNS VOID AS $$
616 BEGIN
617  PERFORM MADLIB_SCHEMA.margins_logregr(
618  source_table,
619  out_table,
620  dependent_variable,
621  independent_variable,
622  grouping_cols,
623  marginal_vars,
624  max_iter,
625  optimizer,
626  0.0001,
627  False);
628 END;
629 $$ LANGUAGE plpgsql VOLATILE;
630 
631 /**
632  * @brief Marginal effects with default variable_names
633  **/
634 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_logregr(
635  source_table VARCHAR -- name of input table
636  , out_table VARCHAR -- name of output table
637  , dependent_variable VARCHAR -- name of dependent variable
638  , independent_variable VARCHAR -- name of independent variable
639  , grouping_cols VARCHAR -- name of grouping cols
640  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
641  , max_iter INTEGER -- Max iterations for the logstic regression inner call
642  , optimizer VARCHAR -- Logistic regression optimizer
643  , tolerance DOUBLE PRECISION -- Tolerance
644  )
645 RETURNS VOID AS $$
646 BEGIN
647  PERFORM MADLIB_SCHEMA.margins_logregr(
648  source_table,
649  out_table,
650  dependent_variable,
651  independent_variable,
652  grouping_cols,
653  marginal_vars,
654  max_iter,
655  optimizer,
656  tolerance,
657  False);
658 END;
659 $$ LANGUAGE plpgsql VOLATILE;
660 -- End of Default Variable calls for margins_logregr
661 ------------------------------------------------------------------------------
662 
663 
664 
665 ------------------ Marginal Multi-Logistic Regression ------------------------------
666 
667 CREATE TYPE MADLIB_SCHEMA.marginal_mlogregr_result AS (
668  margins DOUBLE PRECISION[],
669  coef DOUBLE PRECISION[],
670  std_err DOUBLE PRECISION[],
671  t_stats DOUBLE PRECISION[],
672  p_values DOUBLE PRECISION[]
673 );
674 
675 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_marginal_step_transition(
676  state DOUBLE PRECISION[],
677  y INTEGER,
678  num_categories INTEGER,
679  ref_category INTEGER,
680  x DOUBLE PRECISION[],
681  coef DOUBLE PRECISION[])
682 RETURNS DOUBLE PRECISION[]
683 AS 'MODULE_PATHNAME'
684 LANGUAGE C IMMUTABLE;
685 
686 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_marginal_step_merge_states(
687  state1 DOUBLE PRECISION[],
688  state2 DOUBLE PRECISION[])
689 RETURNS DOUBLE PRECISION[]
690 AS 'MODULE_PATHNAME'
691 LANGUAGE C IMMUTABLE STRICT;
692 
693 
694 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_marginal_step_final(
695  state DOUBLE PRECISION[])
696 RETURNS MADLIB_SCHEMA.marginal_mlogregr_result
697 AS 'MODULE_PATHNAME'
698 LANGUAGE C IMMUTABLE STRICT;
699 
700 
701 /**
702  * @brief Compute marginal effects for multinomial logistic regression.
703  *
704  * @param dependentVariable Column containing the dependent variable
705  * @param independentVariables Column containing the array of independent variables
706  * @param num_categories Number of categories
707  * @param ref_category Reference category
708  * @param coef Column containing the array of the coefficients (as obtained by mlogregr)
709  *
710  * @par
711  * To include an intercept in the model, set one coordinate in the
712  * <tt>independentVariables</tt> array to 1.
713  *
714  * @return A composite value:
715  * - <tt>margins FLOAT8[] </tt> - Array of marginal effects
716  * - <tt>coef FLOAT8[] </tt> - The coefficients for the regression
717  * - <tt>std_err FLOAT8[]</tt> - Array of standard-errors (using the delta method),
718  * - <tt>t_stats FLOAT8[]</tt> - Array of t-statistics
719  * - <tt>p_values FLOAT8[]</tt> - Array of p-values
720  *
721  * @usage
722  * - Get all the diagnostic statistics:\n
723  *
724  * <pre> SELECT marginal_mlogregr(<em>dependentVariable</em>,
725  * <em>independentVariables</em>, coef)
726  * FROM <em>dataTable</em>;
727  * </pre>
728  */
729 
730 CREATE AGGREGATE MADLIB_SCHEMA.marginal_mlogregr(
731  /*+ "dependentVariable" */ INTEGER,
732  /*+ "num_categories" */ INTEGER,
733  /*+ "ref_category" */ INTEGER,
734  /*+ "independentVariables" */ DOUBLE PRECISION[],
735  /*+ "coef" */ DOUBLE PRECISION[] )(
736  STYPE=DOUBLE PRECISION[],
737  SFUNC=MADLIB_SCHEMA.mlogregr_marginal_step_transition,
738  m4_ifdef(`__GREENPLUM__',`PREFUNC=MADLIB_SCHEMA.mlogregr_marginal_step_merge_states,')
739  FINALFUNC=MADLIB_SCHEMA.mlogregr_marginal_step_final,
740  INITCOND='{0,0,0,0,0,0}'
741 );
742 
743 
744 --------------------------- Interface ----------------------------------
745 
746 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
747  usage_string VARCHAR -- usage string
748 )
749 RETURNS VARCHAR AS $$
750 DECLARE
751 insert_string VARCHAR;
752 BEGIN
753  IF (usage_string = '' OR usage_string = 'help' OR usage_string = '?') THEN
754  insert_string := '' ||
755  E'Summary \n' ||
756  E'-----------------------------------------------------------------------------------------\n' ||
757  E' Functionality: Calculate marginal effects for logistic regression\n' ||
758  E' The funciton first runs the regression and then calculates the \n' ||
759  E' coefficients and uses them to calculate the marginal effects \n' ||
760  E' SELECT {schema_madlib}.margins_mlogregr(''source_table'' \n' ||
761  E' ,''output_table'' \n' ||
762  E' ,''dependent_variable'' \n' ||
763  E' ,''independent_variable'' \n' ||
764  E' [, ref_category \n' ||
765  E' ,''grouping_cols'' \n' ||
766  E' ,''marginal_variables'' \n' ||
767  E' , max_iter \n' ||
768  E' ,''optimizer'' \n' ||
769  E' , tolerance \n' ||
770  E' , verbose_mode \n' ||
771  E' ]' ||
772  E' );\n' ||
773  E'For more details on function usage: \n' ||
774  E'SELECT {schema_madlib}.margins_mlogregr(''usage'') \n' ||
775  E'';
776  ElSIF (usage_string = 'usage') THEN
777  insert_string := '' ||
778  E'-----------------------------------------------------------------------------------------\n' ||
779  E'Usage\n' ||
780  E'-----------------------------------------------------------------------------------------\n' ||
781  E' SELECT {schema_madlib}.margins_mlogregr( \n' ||
782  E' ''source_table'', -- Name of data table \n' ||
783  E' ''output_table'', -- Name of result table (overwrites if exists) \n' ||
784  E' ''dependent_variable'', -- Name of column for dependent variables\n' ||
785  E' ''independent_variable'', -- Name of column for independent variables\n' ||
786  E' (can be any SQL expression that evaluates to an array)\n' ||
787  E'[ ref_category, -- Reference category for the multinomial logisitic regressionby\n' ||
788  E' (Optional : Default 0) \n' ||
789  E' ''group_cols'', -- Comma separated string with columns to group by\n' ||
790  E' (Optional : Default NULL => no grouping) \n' ||
791  E' margins_vars, -- Index list (base 1) representing the independent variables to compute marginal effects on \n' ||
792  E' (Optional : Default NULL which computes marginal effects on all variables) \n' ||
793  E' max_iter, -- Maximum number of iterations for the logistic regression\n' ||
794  E' (Optional : Default 20) \n' ||
795  E' ''optimizer'', -- Optimizer for logistic regression (newton/irls, cg or igd)\n' ||
796  E' (Optional : Default irls) \n' ||
797  E' tolerance, -- Termination criterion for logistic regression (relative)\n' ||
798  E' (Optional : Default 1e-4) \n' ||
799  E' verbose_mode, -- Verbose mode \n' ||
800  E' (Optional : Default False) \n' ||
801  E']' ||
802  E' );\n' ||
803  E'\n' ||
804  E'-----------------------------------------------------------------------------------------\n' ||
805  E'Output:\n' ||
806  E'-----------------------------------------------------------------------------------------\n' ||
807  E' The output table (''output_table'' above) has the following columns\n' ||
808  E' ''margins'' DOUBLE PRECISION[], -- Marginal effects \n' ||
809  E' ''std_err'' DOUBLE PRECISION[], -- Standard errors using the delta method\n' ||
810  E' ''t_stats'' DOUBLE PRECISION[], -- T-stats of the standard errors\n' ||
811  E' ''p_values'' DOUBLE PRECISION[], -- p-values of the standard errors\n' ||
812  E'\n' ||
813  E'';
814  ELSE
815  insert_string := 'No such option. Run SELECT {schema_madlib}.margins_mlogregr()';
816  END IF;
817  RETURN insert_string;
818 END;
819 $$ LANGUAGE plpgsql VOLATILE;
820 
821 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr()
822 RETURNS VARCHAR AS $$
823 BEGIN
824  RETURN MADLIB_SCHEMA.margins_mlogregr('');
825 END;
826 $$ LANGUAGE plpgsql VOLATILE;
827 
828 
829 /**
830  @brief A wrapper function for the various marginal multinomial regression
831  *
832  * @param source_table String identifying the input table
833  * @param out_table String identifying the output table to be created
834  * @param dependent_varname Column containing the dependent variable
835  * @param input_independent_varname Column containing the array of independent variables
836  * @param ref_category Reference category for the multinomial logistic regression
837  * @param grouping_cols Set of columns to group by.
838  * @param marginal_vars Subset of independent variables to calculate marginal effects for.
839  * @param max_iter Maximum number of iterations
840  * @param optimzer Optimizer to be used (newton/irls, cg or idg)
841  * @param tolerance Resiual tolerance
842  * @param verbose_mode Verbose mode (on/off)
843  *
844  * @par
845  * To include an intercept in the model, set one coordinate in the
846  * <tt>independentVariables</tt> array to 1.
847  *
848  * @return void
849  *
850  * @usage
851  * For function summary information. Run
852  * sql> select margins_mlogregr('help');
853  * OR
854  * sql> select margins_mlogregr();
855  * OR
856  * sql> select margins_mlogregr('?');
857  * For function usage information. Run
858  * sql> select margins_mlogregr('usage');
859  *
860  */
861 
862 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
863  source_table VARCHAR -- name of input table
864  , out_table VARCHAR -- name of output table
865  , dependent_varname VARCHAR -- name of dependent variable
866  , input_independent_varname VARCHAR -- name of independent variable
867  , ref_category INTEGER -- reference category
868  , input_group_cols VARCHAR -- names of columns to group by
869  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
870  , max_iter INTEGER -- Max iterations for the logstic regression inner call
871  , optimizer VARCHAR -- Logistic regression optimizer
872  , tolerance DOUBLE PRECISION -- Tolerance
873  , verbose_mode BOOLEAN -- Verbose mode
874  )
875 RETURNS VOID AS $$
876 PythonFunction(regress, marginal, margins_mlogregr)
877 $$ LANGUAGE plpythonu;
878 
879 
880 
881 -- Default Variable calls for margins_mlogregr
882 ------------------------------------------------------------------------------
883 
884 /**
885  * @brief Marginal effects with default variables
886  **/
887 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
888  source_table VARCHAR -- name of input table
889  , out_table VARCHAR -- name of output table
890  , dependent_variable VARCHAR -- name of dependent variable
891  , independent_variable VARCHAR -- name of independent variable
892  )
893 RETURNS VOID AS $$
894 BEGIN
895  PERFORM MADLIB_SCHEMA.margins_mlogregr(
896  source_table,
897  out_table,
898  dependent_variable,
899  independent_variable,
900  0,
901  NULL,
902  NULL,
903  20,
904  'irls',
905  0.0001,
906  False);
907 END;
908 $$ LANGUAGE plpgsql VOLATILE;
909 
910 /**
911  * @brief Marginal effects with default variables
912  **/
913 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
914  source_table VARCHAR -- name of input table
915  , out_table VARCHAR -- name of output table
916  , dependent_variable VARCHAR -- name of dependent variable
917  , independent_variable VARCHAR -- name of independent variable
918  , ref_category INTEGER -- reference category
919  )
920 RETURNS VOID AS $$
921 BEGIN
922  PERFORM MADLIB_SCHEMA.margins_mlogregr(
923  source_table,
924  out_table,
925  dependent_variable,
926  independent_variable,
927  ref_category,
928  NULL,
929  NULL,
930  20,
931  'irls',
932  0.0001,
933  False);
934 END;
935 $$ LANGUAGE plpgsql VOLATILE;
936 
937 /**
938  * @brief Marginal effects with default variable_names
939  **/
940 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
941  source_table VARCHAR -- name of input table
942  , out_table VARCHAR -- name of output table
943  , dependent_variable VARCHAR -- name of dependent variable
944  , independent_variable VARCHAR -- name of independent variable
945  , ref_category INTEGER -- reference category
946  , grouping_cols VARCHAR -- name of grouping cols
947  )
948 RETURNS VOID AS $$
949 BEGIN
950  PERFORM MADLIB_SCHEMA.margins_mlogregr(
951  source_table,
952  out_table,
953  dependent_variable,
954  independent_variable,
955  ref_category,
956  grouping_cols,
957  NULL,
958  20,
959  'irls',
960  0.0001,
961  False);
962 END;
963 $$ LANGUAGE plpgsql VOLATILE;
964 
965 /**
966  * @brief Marginal effects with default variable_names
967  **/
968 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
969  source_table VARCHAR -- name of input table
970  , out_table VARCHAR -- name of output table
971  , dependent_variable VARCHAR -- name of dependent variable
972  , independent_variable VARCHAR -- name of independent variable
973  , ref_category INTEGER -- reference category
974  , grouping_cols VARCHAR -- name of grouping cols
975  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
976  )
977 RETURNS VOID AS $$
978 BEGIN
979  PERFORM MADLIB_SCHEMA.margins_mlogregr(
980  source_table,
981  out_table,
982  dependent_variable,
983  independent_variable,
984  ref_category,
985  grouping_cols,
986  marginal_vars,
987  20,
988  'irls',
989  0.0001,
990  False);
991 END;
992 $$ LANGUAGE plpgsql VOLATILE;
993 
994 /**
995  * @brief Marginal effects with default variable_names
996  **/
997 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
998  source_table VARCHAR -- name of input table
999  , out_table VARCHAR -- name of output table
1000  , dependent_variable VARCHAR -- name of dependent variable
1001  , independent_variable VARCHAR -- name of independent variable
1002  , ref_category INTEGER -- reference category
1003  , grouping_cols VARCHAR -- name of grouping cols
1004  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
1005  , max_iter INTEGER -- Max iterations for the logstic regression inner call
1006  )
1007 RETURNS VOID AS $$
1008 BEGIN
1009  PERFORM MADLIB_SCHEMA.margins_mlogregr(
1010  source_table,
1011  out_table,
1012  dependent_variable,
1013  independent_variable,
1014  ref_category,
1015  grouping_cols,
1016  marginal_vars,
1017  max_iter,
1018  'irls',
1019  0.0001,
1020  False);
1021 END;
1022 $$ LANGUAGE plpgsql VOLATILE;
1023 
1024 
1025 /**
1026  * @brief Marginal effects with default variable_names
1027  **/
1028 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
1029  source_table VARCHAR -- name of input table
1030  , out_table VARCHAR -- name of output table
1031  , dependent_variable VARCHAR -- name of dependent variable
1032  , independent_variable VARCHAR -- name of independent variable
1033  , ref_category INTEGER -- reference category
1034  , grouping_cols VARCHAR -- name of grouping cols
1035  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
1036  , max_iter INTEGER -- Max iterations for the logstic regression inner call
1037  , optimizer VARCHAR -- Logistic regression optimizer
1038  )
1039 RETURNS VOID AS $$
1040 BEGIN
1041  PERFORM MADLIB_SCHEMA.margins_mlogregr(
1042  source_table,
1043  out_table,
1044  dependent_variable,
1045  independent_variable,
1046  ref_category,
1047  grouping_cols,
1048  marginal_vars,
1049  max_iter,
1050  optimizer,
1051  0.0001,
1052  False);
1053 END;
1054 $$ LANGUAGE plpgsql VOLATILE;
1055 
1056 /**
1057  * @brief Marginal effects with default variable_names
1058  **/
1059 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.margins_mlogregr(
1060  source_table VARCHAR -- name of input table
1061  , out_table VARCHAR -- name of output table
1062  , dependent_variable VARCHAR -- name of dependent variable
1063  , independent_variable VARCHAR -- name of independent variable
1064  , ref_category INTEGER -- reference category
1065  , grouping_cols VARCHAR -- name of grouping cols
1066  , marginal_vars INTEGER[] -- indices of variables to calculate marginal effects on
1067  , max_iter INTEGER -- Max iterations for the logstic regression inner call
1068  , optimizer VARCHAR -- Logistic regression optimizer
1069  , tolerance DOUBLE PRECISION -- Tolerance
1070  )
1071 RETURNS VOID AS $$
1072 BEGIN
1073  PERFORM MADLIB_SCHEMA.margins_mlogregr(
1074  source_table,
1075  out_table,
1076  dependent_variable,
1077  independent_variable,
1078  ref_category,
1079  grouping_cols,
1080  marginal_vars,
1081  max_iter,
1082  optimizer,
1083  tolerance,
1084  False);
1085 END;
1086 $$ LANGUAGE plpgsql VOLATILE;
1087 -- End of Default Variable calls for margins_mlogregr
1088 ------------------------------------------------------------------------------