User Documentation
 All Files Functions Groups
elastic_net.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file elastic_net.sql_in
4  *
5  * @brief SQL functions for elastic net regularization
6  * @date July 2012
7  *
8  * @sa For a brief introduction to elastic net, see the module
9  * description \ref grp_lasso.
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4') --'
14 
15 /**
16 @addtogroup grp_elasticnet
17 
18 @about
19 
20 This module implements the elastic net regularization for regression problems.
21 
22 This method seeks to find a weight vector that, for any given training example set, minimizes:
23 \f[\min_{w \in R^N} L(w) + \lambda \left(\frac{(1-\alpha)}{2} \|w\|_2^2 + \alpha \|w\|_1 \right)\f]
24 where \f$L\f$ is the metric function that the user wants to minimize. Here \f$ \alpha \in [0,1] \f$
25 and \f$ lambda \geq 0 \f$. If \f$alpha = 0\f$, we have the ridge regularization (known also as Tikhonov regularization), and if \f$\alpha = 1\f$, we have the LASSO regularization.
26 
27 For the Gaussian response family (or linear model), we have
28 \f[L(\vec{w}) = \frac{1}{2}\left[\frac{1}{M} \sum_{m=1}^M (w^{t} x_m + w_{0} - y_m)^2 \right]
29 \f]
30 
31 For the Binomial response family (or logistic model), we have
32 \f[
33 L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
34  \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
35  \vec{w}\cdot\vec{x}_m}\right)\right]\ ,
36 \f]
37 where \f$y_m \in {0,1}\f$.
38 
39 To get better convergence, one can rescale the value of each element of x
40 \f[ x' \leftarrow \frac{x - \bar{x}}{\sigma_x} \f]
41 and for Gaussian case we also let
42 \f[y' \leftarrow y - \bar{y} \f]
43 and then minimize with the regularization terms.
44 At the end of the calculation, the orginal scales will be restored and an
45 intercept term will be obtained at the same time as a by-product.
46 
47 Note that fitting after scaling is not equivalent to directly fitting.
48 
49 Right now, two optimizers are supported. The default one is FISTA, and the other is IGD.
50 They have their own parameters, which can be specified in the <em>optimizer_params</em>
51 as a text array. For example, 'max_stepsize = 0.1, warmup = t, warmup_lambdas = [0.4, 0.3, 0.2]'.
52 
53 <b>(1) FISTA</b>
54 
55 Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [2] has the following optimizer-specific parameters:
56 
57  - max_stepsize - default is 4.0
58  - eta - default is 2, if stepsize does not work
59  stepsize/eta will be tried
60  - warmup - default is False
61  - warmup_lambdas - default is NULL, which means that lambda
62  values will be automatically generated
63  - warmup_lambda_no - default is 15. How many lambda's are used in
64  warm-up, will be overridden if warmup_lambdas
65  is not NULL
66  - warmup_tolerance - default is the same as tolerance. The value
67  of tolerance used during warmup.
68  - use_active_set - default is False. Whether to use active-set
69  method to speed up the computation.
70  - activeset_tolerance - default is the same as tolerance. The
71  value of tolerance used during active set
72  calculation
73  - random_stepsize - default is False. Whether add some randomness
74  to the step size. Sometimes, this can speed
75  up the calculation.
76 
77 
78 Here, backtracking for step size is used. At each iteration, we first try the
79 <em>stepsize = max_stepsize</em>, and if it does not work out, we then try a
80 smaller step size <em>stepsize = stepsize / eta</em>, where <em>eta</em> must be
81 larger than 1. At first sight, this seems to do repeated iterations for even one
82 step, but it actually greatly increases the computation speed by using a larger
83 step size and minimizes the total number of iterations. A careful choice of
84 max_stepsize can decrease the computation time by more than 10 times.
85 
86 
87 If <em>warmup</em> is <em>True</em>, a series of lambda values, which is
88 strictly descent and ends at the lambda value that the user wants to calculate,
89 will be used. The larger lambda gives very sparse solution, and the sparse
90 solution again is used as the initial guess for the next lambda's solution,
91 which will speed up the computation for the next lambda. For larger data sets,
92 this can sometimes accelerate the whole computation and might be faster than
93 computation on only one lambda value.
94 
95 If <em>use_active_set</em> is <em>True</em>, active-set method will be used to
96 speed up the computation. Considerable speedup is obtained by organizing the
97 iterations around the active set of features— those with nonzero coefficients.
98 After a complete cycle through all the variables, we iterate on only the active
99 set till convergence. If another complete cycle does not change the active set,
100 we are done, otherwise the process is repeated.
101 
102 <b>(2) IGD</b>
103 
104 Incremental Gradient Descent (IGD) or Stochastic Gradient Descent (SGD) [3] has the following optimizer-specific parameters:
105 
106  - stepsize - default is 0.01
107  - threshold - default is 1e-10. When a coefficient is really
108  small, set it to be 0
109  - warmup - default is False
110  - warmup_lambdas - default is Null
111  - warmup_lambda_no - default is 15. How many lambda's are used in
112  warm-up, will be overridden if warmup_lambdas
113  is not NULL
114  - warmup_tolerance - default is the same as tolerance. The value
115  of tolerance used during warmup.
116  - parallel - default is True. Run the computation on
117  multiple segments or not.
118 
119 Due to the stochastic nature of SGD, we can only obtain very small values for
120 the fitting coefficients. Therefore, <em>threshold</em> is needed at the end of
121 the computation to screen out those tiny values and just hard set them to be
122 zeros. This is done as the following: (1) multiply each coefficient with the
123 standard deviation of the corresponding feature (2) compute the average of
124 absolute values of re-scaled coefficients (3) divide each rescaled coefficients
125 with the average, and if the resulting absolute value is smaller than
126 <em>threshold</em>, set the original coefficient to be zero.
127 
128 SGD is in nature a sequential algorithm, and when running in a distributed way,
129 each segment of the data runs its own SGD model, and the models are averaged to
130 get a model for each iteration. This average might slow down the convergence
131 speed, although we acquire the ability to process large datasets on multiple
132 machines. So this algorithm provides an option <em>parallel</em> to let the user
133 choose whether to do parallel computation.
134 
135 <b>Stopping Criteria</b> Both optimizers compute the average difference between
136 the coefficients of two consecutive iterations, and if the difference is
137 smaller than <em>tolerance</em> or the iteration number is larger than
138 <em>max_iter</em>, the computation stops.
139 
140 <b>Online Help</b> The user can read short help messages by using any one of the following
141 \code
142 SELECT madlib.elastic_net_train();
143 SELECT madlib.elastic_net_train('usage');
144 SELECT madlib.elastic_net_train('predict');
145 SELECT madlib.elastic_net_train('gaussian');
146 SELECT madlib.elastic_net_train('binomial');
147 SELECT madlib.elastic_net_train('linear');
148 SELECT madlib.elastic_net_train('fista');
149 SELECT madlib.elastic_net_train('igd');
150 \endcode
151 
152 @input
153 
154 The <b>training examples</b> is expected to be of the following form:
155 \code
156 {TABLE|VIEW} <em>input_table</em> (
157  ...
158  <em>independentVariables</em> DOUBLE PRECISION[],
159  <em>dependentVariable</em> DOUBLE PRECISION,
160  ...
161 )
162 \endcode
163 
164 Null values are not expected.
165 
166 @usage
167 
168 <b>Pre-run </b> Usually one gets better results and faster convergence using
169 <em>standardize = True</em>.
170 <b>It is highly recommended to run
171 <em>elastic_net_train</em> function on a subset of the data with limited
172 <em>max_iter</em> before applying it onto the full data set with a large
173 <em>max_iter</em>. In the pre-run, the user can tweak the parameters to get the
174 best performance and then apply the best set of parameters onto the whole data
175 set.</b>
176 
177 - Get the fitting coefficients for a linear model:
178 
179 \code
180  SELECT {schema_madlib}.elastic_net_train (
181  'tbl_source', -- Data table
182  'tbl_result', -- Result table
183  'col_dep_var', -- Dependent variable, can be an expression
184  'col_ind_var', -- Independent variable, can be an expression or '*'
185  'regress_family', -- 'gaussian' (or 'linear'). 'binomial'
186  (or 'logistic') will be supported
187  alpha, -- Elastic net control parameter, value in [0, 1]
188  lambda_value, -- Regularization parameter, positive
189  standardize, -- Whether to normalize the data. Default: True
190  'grouping_col', -- Group by which columns. Default: NULL
191  'optimizer', -- Name of optimizer. Default: 'fista'
192  'optimizer_params',-- Optimizer parameters, delimited by comma. Default: NULL
193  'excluded', -- Column names excluded from '*'. Default: NULL
194  max_iter, -- Maximum iteration number. Default: 10000
195  tolerance -- Stopping criteria. Default: 1e-6
196  );
197 \endcode
198 
199 If <em>col_ind_var</em> is '*', then all columns of <em>tbl_source</em> will be
200 used as features except those listed in the <em>excluded</em> string. If the
201 dependent variable is a column name, it is then automatically excluded from the
202 features. However, if the dependent variable is a valid Postgres expression,
203 then the column names inside this expression are not excluded unless explicitly
204 put into the <em>excluded</em> list. So it is a good idea to put all column
205 names involved in the dependent variable expression into the <em>excluded</em>
206 string.
207 
208 The <em>excluded</em> string is a list of column names excluded from features
209 delimited by comma. For example, 'col1, col2'. If it is NULL or an empty string
210 '', no column is excluded.
211 
212 If <em>col_ind_var</em> is a single column name, which is the array type, one
213 can still use <em>excluded</em>. For example, if <em>x</em> is a column name,
214 which is an array of size 1000, and the user wants to exclude the 100-th, 200-th
215 and 301-th elements of the array, he can set <em>excluded</em> to be '100, 200,
216 301'.
217 
218 Both <em>col_dep_var</em> and <em>col_ind_var</em> can be valid Postgres
219 expression. For example, <em>col_dep_var = 'log(y+1)'</em>, and <em>col_ind_var
220 = 'array[exp(x[1]), x[2], 1/(1+x[3])]'</em> etc. In the binomial case, one can
221 set <em>col_dep_var = 'y < 0'</em> etc.
222 
223  Output:
224  \code
225  family | features | features_selected | coef_nonzero | coef_all | intercept | log_likelihood | standardize | iteration_run
226  ------------------+------------+------------+------------+--------------+-------------+--------+--------+-----------
227  ...
228  \endcode
229 
230 where <em>log_likelihood</em> is just the negative value of the first equation above (up to a constant depending on the data set).
231 
232 - Get the <b>prediction</b> on a data set using a linear model:
233 \code
234 SELECT madlib.elastic_net_predict(
235  '<em>regress_family</em>', -- Response type, 'gaussian' ('linear') or 'binomial' ('logistic')
236  <em>coefficients</em>, -- fitting coefficients
237  <em>intercept</em>, -- fitting intercept
238  <em>independent Variables</em>
239 ) from tbl_data, tbl_train_result;
240 \endcode
241 The above function returns a double value for each data point.
242 When predicting with binomial models, the return value is 1
243 if the predicted result is True, and 0 if the prediction is
244 False.
245 
246 <b>Or</b>
247 
248 (1)
249 \code
250 SELECT madlib.elastic_net_gaussian_predict (
251  coefficients, intercept, ind_var
252 ) FROM tbl_result, tbl_new_source LIMIT 10;
253 \endcode
254 
255 (2)
256 \code
257 SELECT madlib.elastic_net_binomial_predict (
258  coefficients, intercept, ind_var
259 ) FROM tbl_result, tbl_new_source LIMIT 10;
260 \endcode
261 
262 This returns 10 BOOLEAN values.
263 
264 (3)
265 \code
266 SELECT madlib.elastic_net_binomial_prob (
267  coefficients, intercept, ind_var
268 ) FROM tbl_result, tbl_new_source LIMIT 10;
269 \endcode
270 
271 This returns 10 probability values for True class.
272 
273 <b>Or</b> The user can use another prediction function which stores the prediction
274 result in a table. This is usefule if the user wants to use elastic net together with general cross validation function.
275 \code
276 SELECT madlib.elastic_net_predict(
277  'tbl_train_result',
278  'tbl_data',
279  'col_id', -- ID associated with each row
280  'tbl_predict' -- Prediction result
281 );
282 \endcode
283 
284 @examp
285 
286 -# Prepare an input table/view:
287 \code
288 CREATE TABLE en_data (
289  ind_var DOUBLE PRECISION[],
290  dep_var DOUBLE PRECISION
291 );
292 \endcode
293 -# Populate the input table with some data, which should be well-conditioned, e.g.:
294 \code
295 INSERT INTO lasso_data values ({1, 1}, 0.89);
296 INSERT INTO lasso_data values ({0.67, -0.06}, 0.3);
297 ...
298 INSERT INTO lasso_data values ({0.15, -1.3}, -1.3);
299 \endcode
300 -# learn coefficients, e.g.:
301 \code
302 SELECT madlib.elastic_net_train('en_data', 'en_model', 'ind_var', 'dep_var', 0.5, 0.1,
303  True, 'linear', 'igd', 'stepsize = 0.1, warmup = t,
304  warmup_lambda_no=3, warmup_lambdas = [0.4, 0.3, 0.2, 0.1],
305  parallel=t', '1', 10000, 1e-6);
306 \endcode
307 \code
308 SELECT madlib.elastic_net_predict(family, coef_all, intercept, ind_var)
309 FROM en_data, en_model;
310 \endcode
311 
312 @literature
313 
314 [1] Elastic net regularization. http://en.wikipedia.org/wiki/Elastic_net_regularization
315 
316 [2] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorothm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202.
317 
318 [3] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.
319 
320 @sa File elastic_net.sql_in documenting the SQL functions.
321 
322 */
323 
324 ------------------------------------------------------------------------
325 
326 /**
327  * @brief Interface for elastic net
328  *
329  * @param tbl_source Name of data source table
330  * @param tbl_result Name of the table to store the results
331  * @param col_ind_var Name of independent variable column, independent variable is an array
332  * @param col_dep_var Name of dependent variable column
333  * @param regress_family Response type (gaussian or binomial)
334  * @param alpha The elastic net parameter, [0, 1]
335  * @param lambda_value The regularization parameter
336  * @param standardize Whether to normalize the variables (default True)
337  * @param grouping_col List of columns on which to apply grouping
338  * (currently only a placeholder)
339  * @param optimizer The optimization algorithm, 'fista' or 'igd'. Default is 'fista'
340  * @param optimizer_params Parameters of the above optimizer,
341  * the format is 'arg = value, ...'. Default is NULL
342  * @param exclude Which columns to exclude? Default is NULL
343  * (applicable only if col_ind_var is set as * or a column of array,
344  * column names as 'col1, col2, ...' if col_ind_var is '*';
345  * element indices as '1,2,3, ...' if col_ind_var is a column of array)
346  * @param max_iter Maximum number of iterations to run the algorithm
347  * (default value of 10000)
348  * @param tolerance Iteration stopping criteria. Default is 1e-6
349  */
350 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
351  tbl_source TEXT,
352  tbl_result TEXT,
353  col_dep_var TEXT,
354  col_ind_var TEXT,
355  regress_family TEXT,
356  alpha DOUBLE PRECISION,
357  lambda_value DOUBLE PRECISION,
358  standardize BOOLEAN,
359  grouping_col TEXT,
360  optimizer TEXT,
361  optimizer_params TEXT,
362  excluded TEXT,
363  max_iter INTEGER,
364  tolerance DOUBLE PRECISION
365 ) RETURNS VOID AS $$
366 PythonFunction(elastic_net, elastic_net, elastic_net_train)
367 $$ LANGUAGE plpythonu;
368 
369 ------------------------------------------------------------------------
370 -- Overloaded functions
371 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
372  tbl_source TEXT,
373  tbl_result TEXT,
374  col_ind_var TEXT,
375  col_dep_var TEXT,
376  regress_family TEXT,
377  alpha DOUBLE PRECISION,
378  lambda_value DOUBLE PRECISION,
379  standardization BOOLEAN,
380  grouping_columns TEXT,
381  optimizer TEXT,
382  optimizer_params TEXT,
383  excluded TEXT,
384  max_iter INTEGER
385 ) RETURNS VOID AS $$
386 BEGIN
387  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
388  $9, $10, $11, $12, $13, 1e-6);
389 END;
390 $$ LANGUAGE plpgsql VOLATILE;
391 
392 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
393  tbl_source TEXT,
394  tbl_result TEXT,
395  col_ind_var TEXT,
396  col_dep_var TEXT,
397  regress_family TEXT,
398  alpha DOUBLE PRECISION,
399  lambda_value DOUBLE PRECISION,
400  standardization BOOLEAN,
401  grouping_columns TEXT,
402  optimizer TEXT,
403  optimizer_params TEXT,
404  excluded TEXT
405 ) RETURNS VOID AS $$
406 BEGIN
407  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
408  $9, $10, $11, $12, 10000);
409 END;
410 $$ LANGUAGE plpgsql VOLATILE;
411 
412 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
413  tbl_source TEXT,
414  tbl_result TEXT,
415  col_ind_var TEXT,
416  col_dep_var TEXT,
417  regress_family TEXT,
418  alpha DOUBLE PRECISION,
419  lambda_value DOUBLE PRECISION,
420  standardization BOOLEAN,
421  grouping_columns TEXT,
422  optimizer TEXT,
423  optimizer_params TEXT
424 ) RETURNS VOID AS $$
425 BEGIN
426  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
427  $9, $10, $11, NULL);
428 END;
429 $$ LANGUAGE plpgsql VOLATILE;
430 
431 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
432  tbl_source TEXT,
433  tbl_result TEXT,
434  col_ind_var TEXT,
435  col_dep_var TEXT,
436  regress_family TEXT,
437  alpha DOUBLE PRECISION,
438  lambda_value DOUBLE PRECISION,
439  standardization BOOLEAN,
440  grouping_columns TEXT,
441  optimizer TEXT
442 ) RETURNS VOID AS $$
443 BEGIN
444  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
445  $9, $10, NULL::TEXT);
446 END;
447 $$ LANGUAGE plpgsql VOLATILE;
448 
449 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
450  tbl_source TEXT,
451  tbl_result TEXT,
452  col_ind_var TEXT,
453  col_dep_var TEXT,
454  regress_family TEXT,
455  alpha DOUBLE PRECISION,
456  lambda_value DOUBLE PRECISION,
457  standardization BOOLEAN,
458  grouping_columns TEXT
459 ) RETURNS VOID AS $$
460 BEGIN
461  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
462  $9, 'FISTA');
463 END;
464 $$ LANGUAGE plpgsql VOLATILE;
465 
466 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
467  tbl_source TEXT,
468  tbl_result TEXT,
469  col_ind_var TEXT,
470  col_dep_var TEXT,
471  regress_family TEXT,
472  alpha DOUBLE PRECISION,
473  lambda_value DOUBLE PRECISION,
474  standardization BOOLEAN
475 ) RETURNS VOID AS $$
476 BEGIN
477  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8,
478  NULL);
479 END;
480 $$ LANGUAGE plpgsql VOLATILE;
481 
482 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
483  tbl_source TEXT,
484  tbl_result TEXT,
485  col_ind_var TEXT,
486  col_dep_var TEXT,
487  regress_family TEXT,
488  alpha DOUBLE PRECISION,
489  lambda_value DOUBLE PRECISION
490 ) RETURNS VOID AS $$
491 BEGIN
492  PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, True);
493 END;
494 $$ LANGUAGE plpgsql VOLATILE;
495 
496 ------------------------------------------------------------------------
497 
498 /**
499  * @brief Help function, to print out the supported families
500  */
501 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ()
502 RETURNS TEXT AS $$
503 PythonFunction(elastic_net, elastic_net, elastic_net_help)
504 $$ LANGUAGE plpythonu;
505 
506 ------------------------------------------------------------------------
507 
508 /**
509  * @brief Help function, to print out the supported optimizer for a family
510  * or print out the parameter list for an optimizer
511  *
512  * @param family_or_optimizer Response type, 'gaussian' or 'binomial', or
513  * optimizer type
514  */
515 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
516  family_or_optimizer TEXT
517 ) RETURNS TEXT AS $$
518 PythonFunction(elastic_net, elastic_net, elastic_net_help)
519 $$ LANGUAGE plpythonu;
520 
521 ------------------------------------------------------------------------
522 ------------------------------------------------------------------------
523 ------------------------------------------------------------------------
524 
525 /**
526  * @brief Prediction and put the result in a table
527  * can be used together with General-CV
528  * @param tbl_model The result from elastic_net_train
529  * @param tbl_new_source Data table
530  * @param col_id Unique ID associated with each row
531  * @param tbl_predict Prediction result
532  */
533 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict (
534  tbl_model TEXT,
535  tbl_new_source TEXT,
536  col_id TEXT,
537  tbl_predict TEXT
538 ) RETURNS VOID AS $$
539 PythonFunction(elastic_net, elastic_net, elastic_net_predict_all)
540 $$ LANGUAGE plpythonu;
541 
542 ------------------------------------------------------------------------
543 
544 /**
545  * @brief Prediction use learned coefficients for a given example
546  *
547  * @param regress_family model family
548  * @param coefficients The fitting coefficients
549  * @param intercept The fitting intercept
550  * @param ind_var Features (independent variables)
551  *
552  * returns a double value. When regress_family is 'binomial' or 'logistic',
553  * this function returns 1 for True and 0 for False
554  */
555 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict (
556  regress_family TEXT,
557  coefficients DOUBLE PRECISION[],
558  intercept DOUBLE PRECISION,
559  ind_var DOUBLE PRECISION[]
560 ) RETURNS DOUBLE PRECISION AS $$
561 DECLARE
562  family_name TEXT;
563  binomial_result BOOLEAN;
564 BEGIN
565  family_name := lower(regress_family);
566 
567  IF family_name = 'gaussian' OR family_name = 'linear' THEN
568  RETURN MADLIB_SCHEMA.elastic_net_gaussian_predict(coefficients, intercept, ind_var);
569  END IF;
570 
571  IF family_name = 'binomial' OR family_name = 'logistic' THEN
572  binomial_result := MADLIB_SCHEMA.elastic_net_binomial_predict(coefficients, intercept, ind_var);
573  IF binomial_result THEN
574  return 1;
575  ELSE
576  return 0;
577  END IF;
578  END IF;
579 
580  RAISE EXCEPTION 'This regression family is not supported!';
581 END;
582 $$ LANGUAGE plpgsql IMMUTABLE STRICT;
583 
584 ------------------------------------------------------------------------
585 
586  /**
587  * @brief Prediction for linear models use learned coefficients for a given example
588  *
589  * @param coefficients Linear fitting coefficients
590  * @param intercept Linear fitting intercept
591  * @param ind_var Features (independent variables)
592  *
593  * returns a double value
594  */
595 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_gaussian_predict (
596  coefficients DOUBLE PRECISION[],
597  intercept DOUBLE PRECISION,
598  ind_var DOUBLE PRECISION[]
599 ) RETURNS DOUBLE PRECISION AS
600 'MODULE_PATHNAME', '__elastic_net_gaussian_predict'
601 LANGUAGE C IMMUTABLE STRICT;
602 
603 ------------------------------------------------------------------------
604 /**
605  * @brief Prediction for logistic models use learned coefficients for a given example
606  *
607  * @param coefficients Logistic fitting coefficients
608  * @param intercept Logistic fitting intercept
609  * @param ind_var Features (independent variables)
610  *
611  * returns a boolean value
612  */
613 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_predict (
614  coefficients DOUBLE PRECISION[],
615  intercept DOUBLE PRECISION,
616  ind_var DOUBLE PRECISION[]
617 ) RETURNS BOOLEAN AS
618 'MODULE_PATHNAME', '__elastic_net_binomial_predict'
619 LANGUAGE C IMMUTABLE STRICT;
620 
621 ------------------------------------------------------------------------
622 /**
623  * @brief Compute the probability of belonging to the True class for a given observation
624  *
625  * @param coefficients Logistic fitting coefficients
626  * @param intercept Logistic fitting intercept
627  * @param ind_var Features (independent variables)
628  *
629  * returns a double value, which is the probability of this data point being True class
630  */
631 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_prob (
632  coefficients DOUBLE PRECISION[],
633  intercept DOUBLE PRECISION,
634  ind_var DOUBLE PRECISION[]
635 ) RETURNS DOUBLE PRECISION AS
636 'MODULE_PATHNAME', '__elastic_net_binomial_prob'
637 LANGUAGE C IMMUTABLE STRICT;
638 
639 ------------------------------------------------------------------------
640 /* Compute the log-likelihood for one data point */
641 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__elastic_net_binomial_loglikelihood (
642  coefficients DOUBLE PRECISION[],
643  intercept DOUBLE PRECISION,
644  dep_var BOOLEAN,
645  ind_var DOUBLE PRECISION[]
646 ) RETURNS DOUBLE PRECISION AS
647 'MODULE_PATHNAME', '__elastic_net_binomial_loglikelihood'
648 LANGUAGE C IMMUTABLE STRICT;
649 
650 ------------------------------------------------------------------------
651 -- Compute the solution for just one step ------------------------------
652 ------------------------------------------------------------------------
654 CREATE TYPE MADLIB_SCHEMA.__elastic_net_result AS (
655  intercept DOUBLE PRECISION,
656  coefficients DOUBLE PRECISION[],
657  lambda_value DOUBLE PRECISION
658 );
659 
660 ------------------------------------------------------------------------
661 
662 /* IGD */
663 
664 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_transition (
665  state DOUBLE PRECISION[],
666  ind_var DOUBLE PRECISION[],
667  dep_var DOUBLE PRECISION,
668  pre_state DOUBLE PRECISION[],
669  lambda DOUBLE PRECISION,
670  alpha DOUBLE PRECISION,
671  dimension INTEGER,
672  stepsize DOUBLE PRECISION,
673  total_rows INTEGER,
674  xmean DOUBLE PRECISION[],
675  ymean DOUBLE PRECISION,
676  step_decay DOUBLE PRECISION
677 ) RETURNS DOUBLE PRECISION[]
678 AS 'MODULE_PATHNAME', 'gaussian_igd_transition'
679 LANGUAGE C IMMUTABLE;
680 
681 --
682 
683 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_merge (
684  state1 DOUBLE PRECISION[],
685  state2 DOUBLE PRECISION[]
686 ) RETURNS DOUBLE PRECISION[] AS
687 'MODULE_PATHNAME', 'gaussian_igd_merge'
688 LANGUAGE C IMMUTABLE STRICT;
689 
690 --
691 
692 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_final (
693  state DOUBLE PRECISION[]
694 ) RETURNS DOUBLE PRECISION[] AS
695 'MODULE_PATHNAME', 'gaussian_igd_final'
696 LANGUAGE C IMMUTABLE STRICT;
697 
698 /*
699  * Perform one iteration step of IGD for linear models
700  */
701 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step(
702  /* ind_var */ DOUBLE PRECISION[],
703  /* dep_var */ DOUBLE PRECISION,
704  /* pre_state */ DOUBLE PRECISION[],
705  /* lambda */ DOUBLE PRECISION,
706  /* alpha */ DOUBLE PRECISION,
707  /* dimension */ INTEGER,
708  /* stepsize */ DOUBLE PRECISION,
709  /* total_rows */ INTEGER,
710  /* xmeans */ DOUBLE PRECISION[],
711  /* ymean */ DOUBLE PRECISION,
712  /* step_decay */ DOUBLE PRECISION
713 ) (
714  SType = DOUBLE PRECISION[],
715  SFunc = MADLIB_SCHEMA.__gaussian_igd_transition,
716  m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,')
717  FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final,
718  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
719 );
720 
721 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step_single_seg (
722  /* ind_var */ DOUBLE PRECISION[],
723  /* dep_var */ DOUBLE PRECISION,
724  /* pre_state */ DOUBLE PRECISION[],
725  /* lambda */ DOUBLE PRECISION,
726  /* alpha */ DOUBLE PRECISION,
727  /* dimension */ INTEGER,
728  /* stepsize */ DOUBLE PRECISION,
729  /* total_rows */ INTEGER,
730  /* xmeans */ DOUBLE PRECISION[],
731  /* ymean */ DOUBLE PRECISION,
732  /* step_decay */ DOUBLE PRECISION
733 ) (
734  SType = DOUBLE PRECISION[],
735  SFunc = MADLIB_SCHEMA.__gaussian_igd_transition,
736  -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,')
737  FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final,
738  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
739 );
740 
741 --
742 
743 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_state_diff (
744  state1 DOUBLE PRECISION[],
745  state2 DOUBLE PRECISION[]
746 ) RETURNS DOUBLE PRECISION AS
747 'MODULE_PATHNAME', '__gaussian_igd_state_diff'
748 LANGUAGE C IMMUTABLE STRICT;
749 
750 --
751 
752 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_result (
753  in_state DOUBLE PRECISION[],
754  feature_sq DOUBLE PRECISION[],
755  threshold DOUBLE PRECISION,
756  tolerance DOUBLE PRECISION
757 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
758 'MODULE_PATHNAME', '__gaussian_igd_result'
759 LANGUAGE C IMMUTABLE STRICT;
760 
761 ------------------------------------------------------------------------
762 
763 /* FISTA */
764 
765 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_transition (
766  state DOUBLE PRECISION[],
767  ind_var DOUBLE PRECISION[],
768  dep_var DOUBLE PRECISION,
769  pre_state DOUBLE PRECISION[],
770  lambda DOUBLE PRECISION,
771  alpha DOUBLE PRECISION,
772  dimension INTEGER,
773  total_rows INTEGER,
774  max_stepsize DOUBLE PRECISION,
775  eta DOUBLE PRECISION,
776  use_active_set INTEGER,
777  is_active INTEGER,
778  random_stepsize INTEGER
779 ) RETURNS DOUBLE PRECISION[]
780 AS 'MODULE_PATHNAME', 'gaussian_fista_transition'
781 LANGUAGE C IMMUTABLE;
782 
783 --
784 
785 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_merge (
786  state1 DOUBLE PRECISION[],
787  state2 DOUBLE PRECISION[]
788 ) RETURNS DOUBLE PRECISION[] AS
789 'MODULE_PATHNAME', 'gaussian_fista_merge'
790 LANGUAGE C IMMUTABLE STRICT;
791 
792 --
793 
794 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_final (
795  state DOUBLE PRECISION[]
796 ) RETURNS DOUBLE PRECISION[] AS
797 'MODULE_PATHNAME', 'gaussian_fista_final'
798 LANGUAGE C IMMUTABLE STRICT;
799 
800 /*
801  Perform one iteration step of FISTA for linear models
802  */
803 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_fista_step(
804  /* ind_var */ DOUBLE PRECISION[],
805  /* dep_var */ DOUBLE PRECISION,
806  /* pre_state */ DOUBLE PRECISION[],
807  /* lambda */ DOUBLE PRECISION,
808  /* alpha */ DOUBLE PRECISION,
809  /* dimension */ INTEGER,
810  /* total_rows */ INTEGER,
811  /* max_stepsize */ DOUBLE PRECISION,
812  /* eta */ DOUBLE PRECISION,
813  /* use_active_set */ INTEGER,
814  /* is_active */ INTEGER,
815  /* random_stepsize */ INTEGER
816 ) (
817  SType = DOUBLE PRECISION[],
818  SFunc = MADLIB_SCHEMA.__gaussian_fista_transition,
819  m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_fista_merge,')
820  FinalFunc = MADLIB_SCHEMA.__gaussian_fista_final,
821  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
822 );
823 
824 --
825 
826 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_state_diff (
827  state1 DOUBLE PRECISION[],
828  state2 DOUBLE PRECISION[]
829 ) RETURNS DOUBLE PRECISION AS
830 'MODULE_PATHNAME', '__gaussian_fista_state_diff'
831 LANGUAGE C IMMUTABLE STRICT;
832 
833 --
834 
835 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_result (
836  in_state DOUBLE PRECISION[]
837 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
838 'MODULE_PATHNAME', '__gaussian_fista_result'
839 LANGUAGE C IMMUTABLE STRICT;
840 
841 ------------------------------------------------------------------------
842 ------------------------------------------------------------------------
843 ------------------------------------------------------------------------
844 
845 /* Binomial IGD */
846 
847 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_transition (
848  state DOUBLE PRECISION[],
849  ind_var DOUBLE PRECISION[],
850  dep_var BOOLEAN,
851  pre_state DOUBLE PRECISION[],
852  lambda DOUBLE PRECISION,
853  alpha DOUBLE PRECISION,
854  dimension INTEGER,
855  stepsize DOUBLE PRECISION,
856  total_rows INTEGER,
857  xmean DOUBLE PRECISION[],
858  ymean DOUBLE PRECISION,
859  step_decay DOUBLE PRECISION
860 ) RETURNS DOUBLE PRECISION[]
861 AS 'MODULE_PATHNAME', 'binomial_igd_transition'
862 LANGUAGE C IMMUTABLE;
863 
864 --
865 
866 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_merge (
867  state1 DOUBLE PRECISION[],
868  state2 DOUBLE PRECISION[]
869 ) RETURNS DOUBLE PRECISION[] AS
870 'MODULE_PATHNAME', 'binomial_igd_merge'
871 LANGUAGE C IMMUTABLE STRICT;
872 
873 --
874 
875 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_final (
876  state DOUBLE PRECISION[]
877 ) RETURNS DOUBLE PRECISION[] AS
878 'MODULE_PATHNAME', 'binomial_igd_final'
879 LANGUAGE C IMMUTABLE STRICT;
880 
881 /*
882  * Perform one iteration step of IGD for linear models
883  */
884 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step(
885  /* ind_var */ DOUBLE PRECISION[],
886  /* dep_var */ BOOLEAN,
887  /* pre_state */ DOUBLE PRECISION[],
888  /* lambda */ DOUBLE PRECISION,
889  /* alpha */ DOUBLE PRECISION,
890  /* dimension */ INTEGER,
891  /* stepsize */ DOUBLE PRECISION,
892  /* total_rows */ INTEGER,
893  /* xmeans */ DOUBLE PRECISION[],
894  /* ymean */ DOUBLE PRECISION,
895  /* step_decay */ DOUBLE PRECISION
896 ) (
897  SType = DOUBLE PRECISION[],
898  SFunc = MADLIB_SCHEMA.__binomial_igd_transition,
899  m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,')
900  FinalFunc = MADLIB_SCHEMA.__binomial_igd_final,
901  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
902 );
903 
904 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step_single_seg (
905  /* ind_var */ DOUBLE PRECISION[],
906  /* dep_var */ BOOLEAN,
907  /* pre_state */ DOUBLE PRECISION[],
908  /* lambda */ DOUBLE PRECISION,
909  /* alpha */ DOUBLE PRECISION,
910  /* dimension */ INTEGER,
911  /* stepsize */ DOUBLE PRECISION,
912  /* total_rows */ INTEGER,
913  /* xmeans */ DOUBLE PRECISION[],
914  /* ymean */ DOUBLE PRECISION,
915  /* step_decay */ DOUBLE PRECISION
916 ) (
917  SType = DOUBLE PRECISION[],
918  SFunc = MADLIB_SCHEMA.__binomial_igd_transition,
919  -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,')
920  FinalFunc = MADLIB_SCHEMA.__binomial_igd_final,
921  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
922 );
923 
924 --
925 
926 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_state_diff (
927  state1 DOUBLE PRECISION[],
928  state2 DOUBLE PRECISION[]
929 ) RETURNS DOUBLE PRECISION AS
930 'MODULE_PATHNAME', '__binomial_igd_state_diff'
931 LANGUAGE C IMMUTABLE STRICT;
932 
933 --
934 
935 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_result (
936  in_state DOUBLE PRECISION[],
937  feature_sq DOUBLE PRECISION[],
938  threshold DOUBLE PRECISION,
939  tolerance DOUBLE PRECISION
940 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
941 'MODULE_PATHNAME', '__binomial_igd_result'
942 LANGUAGE C IMMUTABLE STRICT;
943 
944 ------------------------------------------------------------------------
945 
946 /* Binomial FISTA */
947 
948 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_transition (
949  state DOUBLE PRECISION[],
950  ind_var DOUBLE PRECISION[],
951  dep_var BOOLEAN,
952  pre_state DOUBLE PRECISION[],
953  lambda DOUBLE PRECISION,
954  alpha DOUBLE PRECISION,
955  dimension INTEGER,
956  total_rows INTEGER,
957  max_stepsize DOUBLE PRECISION,
958  eta DOUBLE PRECISION,
959  use_active_set INTEGER,
960  is_active INTEGER,
961  random_stepsize INTEGER
962 ) RETURNS DOUBLE PRECISION[]
963 AS 'MODULE_PATHNAME', 'binomial_fista_transition'
964 LANGUAGE C IMMUTABLE;
965 
966 --
967 
968 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_merge (
969  state1 DOUBLE PRECISION[],
970  state2 DOUBLE PRECISION[]
971 ) RETURNS DOUBLE PRECISION[] AS
972 'MODULE_PATHNAME', 'binomial_fista_merge'
973 LANGUAGE C IMMUTABLE STRICT;
974 
975 --
976 
977 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_final (
978  state DOUBLE PRECISION[]
979 ) RETURNS DOUBLE PRECISION[] AS
980 'MODULE_PATHNAME', 'binomial_fista_final'
981 LANGUAGE C IMMUTABLE STRICT;
982 
983 /*
984  Perform one iteration step of FISTA for linear models
985  */
986 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_fista_step(
987  /* ind_var */ DOUBLE PRECISION[],
988  /* dep_var */ BOOLEAN,
989  /* pre_state */ DOUBLE PRECISION[],
990  /* lambda */ DOUBLE PRECISION,
991  /* alpha */ DOUBLE PRECISION,
992  /* dimension */ INTEGER,
993  /* total_rows */ INTEGER,
994  /* max_stepsize */ DOUBLE PRECISION,
995  /* eta */ DOUBLE PRECISION,
996  /* use_active_set */ INTEGER,
997  /* is_active */ INTEGER,
998  /* random_stepsize */ INTEGER
999 ) (
1000  SType = DOUBLE PRECISION[],
1001  SFunc = MADLIB_SCHEMA.__binomial_fista_transition,
1002  m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_fista_merge,')
1003  FinalFunc = MADLIB_SCHEMA.__binomial_fista_final,
1004  InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
1005 );
1006 
1007 --
1008 
1009 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_state_diff (
1010  state1 DOUBLE PRECISION[],
1011  state2 DOUBLE PRECISION[]
1012 ) RETURNS DOUBLE PRECISION AS
1013 'MODULE_PATHNAME', '__binomial_fista_state_diff'
1014 LANGUAGE C IMMUTABLE STRICT;
1015 
1016 --
1017 
1018 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_result (
1019  in_state DOUBLE PRECISION[]
1020 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
1021 'MODULE_PATHNAME', '__binomial_fista_result'
1022 LANGUAGE C IMMUTABLE STRICT;
1023 
1024