User Documentation
elastic_net.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//** 
00002  *
00003  * @file elastic_net.sql_in
00004  *
00005  * @brief SQL functions for elastic net regularization
00006  * @date July 2012
00007  *
00008  * @sa For a brief introduction to elastic net, see the module
00009  *     description \ref grp_lasso.
00010  *
00011  *//* ----------------------------------------------------------------------- */
00012 
00013 m4_include(`SQLCommon.m4') --'
00014 
00015 /**
00016 @addtogroup grp_elasticnet
00017 
00018 @about
00019 
00020 This module implements the elastic net regularization for regression problems.
00021 
00022 This method seeks to find a weight vector that, for any given training example set, minimizes:
00023 \f[\min_{w \in R^N} L(w) + \lambda \left(\frac{(1-\alpha)}{2} \|w\|_2^2 + \alpha \|w\|_1 \right)\f]
00024 where \f$L\f$ is the metric function that the user wants to minimize. Here \f$ \alpha \in [0,1] \f$
00025 and \f$ lambda \geq 0 \f$. If \f$alpha = 0\f$, we have the ridge regularization (known also as Tikhonov regularization), and if \f$\alpha = 1\f$, we have the LASSO regularization.
00026 
00027 For the Gaussian response family (or linear model), we have
00028 \f[L(\vec{w}) =  \frac{1}{2}\left[\frac{1}{M} \sum_{m=1}^M (w^{t} x_m + w_{0} - y_m)^2 \right]
00029 \f]
00030 
00031 For the Binomial response family (or logistic model), we have
00032 \f[
00033 L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 +
00034       \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 +
00035       \vec{w}\cdot\vec{x}_m}\right)\right]\ ,
00036 \f]
00037 where \f$y_m \in {0,1}\f$.
00038 
00039 To get better convergence, one can rescale the value of each element of x
00040 \f[ x' \leftarrow \frac{x - \bar{x}}{\sigma_x} \f]
00041 and for Gaussian case we also let
00042 \f[y' \leftarrow y - \bar{y} \f]
00043 and then minimize with the regularization terms.
00044 At the end of the calculation, the orginal scales will be restored and an intercept term will be obtained at the same time as a by-product.
00045 
00046 Note that fitting after scaling is not equivalent to directly fitting.
00047 
00048 Right now, two optimizers are supported. The default one is FISTA, and the other is IGD. They have their own parameters, which can be specified in the <em>optimizer_params</em> as a text array. For example, 'max_stepsize = 0.1, warmup = t, warmup_lambdas = [0.4, 0.3, 0.2]'.
00049 
00050 <b>(1) FISTA</b>
00051 
00052 Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [2] has the following optimizer-specific parameters:
00053 
00054         max_stepsize     - default is 4.0
00055         eta              - default is 2, if stepsize does not work
00056                            stepsize/eta will be tried
00057         warmup           - default is False
00058         warmup_lambdas   - default is NULL, which means that lambda
00059                            values will be automatically generated
00060         warmup_lambda_no - default is 15. How many lambda's are used in
00061                            warm-up, will be overridden if warmup_lambdas     
00062                            is not NULL
00063         warmup_tolerance - default is the same as tolerance. The value
00064                            of tolerance used during warmup. 
00065         use_active_set   - default is False. Whether to use active-set
00066                            method to speed up the computation.
00067         activeset_tolerance - default is the same as tolerance. The
00068                               value of tolerance used during active set
00069                               calculation
00070         random_stepsize - default is False. Whether add some randomness
00071                           to the step size. Sometimes, this can speed
00072                           up the calculation.
00073 
00074 Here, backtracking for step size is used. At each iteration, we first try the <em>stepsize = max_stepsize</em>, and if it does not work out, we then try a smaller step size <em>stepsize = stepsize / eta</em>, where <em>eta</em> must be larger than 1. At first sight, this seems to do repeated iterations for even one step, but it actually greatly increases the computation speed by using a larger step size and minimizes the total number of iterations. A careful choice of max_stepsize can decrease the computation time by more than 10 times.
00075 
00076 If <em>warmup</em> is <em>True</em>, a series of lambda values, which is strictly descent and ends at the lambda value that the user wants to calculate, will be used. The larger lambda gives very sparse solution, and the sparse solution again is used as the initial guess for the next lambda's solution, which will speed up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and might be faster than computation on only one lambda value.
00077 
00078 If <em>use_active_set</em> is <em>True</em>, active-set method will be used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features— those with nonzero coefficients. After a complete cycle through all the variables, we iterate on only the active set till convergence. If another complete cycle does not change the active set, we are done, otherwise the process is repeated.
00079 
00080 <b>(2) IGD</b>
00081 
00082 Incremental Gradient Descent (IGD) or Stochastic Gradient Descent (SGD) [3] has the following optimizer-specific parameters:
00083 
00084         stepsize         - default is 0.01
00085         threshold        - default is 1e-10. When a coefficient is really
00086                            small, set it to be 0
00087         warmup           - default is False
00088         warmup_lambdas   - default is Null
00089         warmup_lambda_no - default is 15. How many lambda's are used in
00090                            warm-up, will be overridden if warmup_lambdas   
00091                            is not NULL
00092         warmup_tolerance - default is the same as tolerance. The value
00093                            of tolerance used during warmup. 
00094         parallel         - default is True. Run the computation on
00095                            multiple segments or not.
00096 
00097 Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, <em>threshold</em> is needed at the end of the computation to screen out those tiny values and just hard set them to be zeros. This is done as the following: (1) multiply each coefficient with the standard deviation of the corresponding feature (2) compute the average of absolute values of re-scaled coefficients (3) divide each rescaled coefficients with the average, and if the resulting absolute value is smaller than <em>threshold</em>, set the original coefficient to be zero.
00098 
00099 SGD is in nature a sequential algorithm, and when running in a distributed way, each segment of the data runs its own SGD model, and the models are averaged to get a model for each iteration. This average might slow down the convergence speed, although we acquire the ability to process large data set on multiple machines. So this algorithm provides an option <em>parallel</em> to let the user choose whether to do parallel computation.
00100 
00101 <b>Stopping Criteria</b> Both optimizers compute the average difference between the coefficients of two consecutive iterations, and if the difference is smaller than <em>tolerance</em> or the iteration number is larger than <em>max_iter</em>, the computation stops.
00102 
00103 <b>Online Help</b> The user can read short help messages by using any one of the following
00104 \code
00105 SELECT madlib.elastic_net_train();
00106 SELECT madlib.elastic_net_train('usage');
00107 SELECT madlib.elastic_net_train('predict');
00108 SELECT madlib.elastic_net_train('gaussian');
00109 SELECT madlib.elastic_net_train('binomial');
00110 SELECT madlib.elastic_net_train('linear');
00111 SELECT madlib.elastic_net_train('fista');
00112 SELECT madlib.elastic_net_train('igd');
00113 \endcode
00114 
00115 @input
00116 
00117 The <b>training examples</b> is expected to be of the following form:
00118 <pre>{TABLE|VIEW} <em>input_table</em> (
00119     ...
00120     <em>independentVariables</em>   DOUBLE PRECISION[],
00121     <em>dependentVariable</em>      DOUBLE PRECISION,
00122     ...
00123 )</pre>
00124 
00125 Null values are not expected.
00126 
00127 @usage
00128 
00129 <b>Pre-run :</b> Usually one gets better results and faster convergence using <em>standardize = True</em>. <b>It is highly recommended to run <em>elastic_net_train</em> function on a subset of the data with limited <em>max_iter</em> before applying it onto the full data set with a large <em>max_iter</em>. In the pre-run, the user can tweak the parameters to get the best performance and then apply the best set of parameters onto the whole data set.</b>
00130 
00131 - Get the fitting coefficients for a linear model:
00132 
00133 <pre>
00134        SELECT {schema_madlib}.elastic_net_train (
00135             'tbl_source',     -- Data table
00136             'tbl_result',     -- Result table
00137             'col_dep_var',    -- Dependent variable, can be an expression or
00138                                     '*'
00139             'col_ind_var',    -- Independent variable, can be an expression
00140             'regress_family', -- 'gaussian' (or 'linear'). 'binomial'
00141                                     (or 'logistic') will be supported
00142             alpha,            -- Elastic net control parameter, value in [0, 1]   
00143             lambda_value,     -- Regularization parameter, positive
00144             standardize,      -- Whether to normalize the data. Default: True
00145             'grouping_col',   -- Group by which columns. Default: NULL
00146             'optimizer',      -- Name of optimizer. Default: 'fista'
00147             'optimizer_params',-- Optimizer parameters, delimited by comma. Default: NULL
00148             'excluded',       -- Column names excluded from '*'. Default: NULL
00149             max_iter,         -- Maximum iteration number. Default: 10000
00150             tolerance         -- Stopping criteria. Default: 1e-6
00151         );
00152 </pre>
00153 
00154 If <em>col_ind_var</em> is '*', then all columns of <em>tbl_source</em> will be used as features except those listed in the <em>excluded</em> string. If the dependent variable is a column name, it is then automatically excluded from the features. However, if the dependent variable is a valid Postgres expression, then the column names inside this expression are not excluded unless explicitly put into the <em>excluded</em> list. So it is a good idea to put all column names involved in the dependent variable expression into the <em>excluded</em> string.
00155 
00156 The <em>excluded</em> string is a list of column names excluded from features delimited by comma. For example, 'col1, col2'. If it is NULL or an empty string '', no column is excluded.
00157 
00158 If <em>col_ind_var</em> is a single column name, which is the array type, one can still use <em>excluded</em>. For example, if <em>x</em> is a column name, which is an array of size 1000, and the user wants to exclude the 100-th, 200-th and 301-th elements of the array, he can set <em>excluded</em> to be '100, 200, 301'.
00159 
00160 Both <em>col_dep_var</em> and <em>col_ind_var</em> can be valid Postgres expression. For example, <em>col_dep_var = 'log(y+1)'</em>, and <em>col_ind_var = 'array[exp(x[1]), x[2], 1/(1+x[3])]'</em> etc. In the binomial case, one can set <em>col_dep_var = 'y < 0'</em> etc.
00161 
00162   Output:
00163   <pre> family | features | features_selected | coef_nonzero | coef_all | intercept | log_likelihood | standardize | iteration_run
00164   ------------------+------------+------------+------------+--------------+-------------+--------+--------+-----------
00165   ...
00166   </pre>
00167 
00168 where <em>log_likelihood</em> is just the negative value of the first equation above (up to a constant depending on the data set).
00169 
00170 - Get the \b prediction on a data set using a linear model:
00171 <pre>
00172 SELECT madlib.elastic_net_predict(
00173     '<em>regress_family</em>',  -- Response type, 'gaussian' ('linear') or 'binomial' ('logistic')
00174     <em>coefficients</em>,    -- fitting coefficients
00175     <em>intercept</em>,  -- fitting intercept
00176     <em>independent Variables</em> 
00177 ) from tbl_data, tbl_train_result;
00178 </pre>
00179 The above function returns a double value for each data point.
00180 When predicting with binomial models, the return value is 1      
00181 if the predicted result is True, and 0 if the prediction is      
00182 False.   
00183 
00184 <b>Or</b>
00185 (1)
00186 <pre>
00187 SELECT madlib.elastic_net_gaussian_predict (                 
00188     coefficients, intercept, ind_var                         
00189 ) FROM tbl_result, tbl_new_source LIMIT 10;
00190 </pre>
00191 
00192 (2)
00193 <pre>
00194 SELECT madlib.elastic_net_binomial_predict (                 
00195     coefficients, intercept, ind_var                         
00196 ) FROM tbl_result, tbl_new_source LIMIT 10;
00197 </pre>
00198                                                                           
00199 This returns 10 BOOLEAN values.                              
00200                                                                           
00201 (3)
00202 <pre>
00203 SELECT madlib.elastic_net_binomial_prob (                    
00204     coefficients, intercept, ind_var                         
00205 ) FROM tbl_result, tbl_new_source LIMIT 10;
00206 </pre>
00207                                                                           
00208 This returns 10 probability values for True class.  
00209 
00210 <b>Or</b> The user can use another prediction function which stores the prediction result in a table. This is usefule if the user wants to use elastic net together with general cross validation function.
00211 <pre>
00212 SELECT madlib.elastic_net_predict(
00213     '<em>tbl_train_result</em>',
00214     '<em>tbl_data</em>',
00215     '<em>col_id</em>',  -- ID associated with each row
00216     '<em>tbl_predict</em>'  -- Prediction result
00217 );
00218 </pre>
00219 
00220 @examp
00221 
00222 -# Prepare an input table/view:
00223 \code
00224 CREATE TABLE en_data (
00225     ind_var DOUBLE PRECISION[],
00226     dep_var DOUBLE PRECISION
00227 );
00228 \endcode     
00229 -# Populate the input table with some data, which should be well-conditioned, e.g.:
00230 \code
00231 mydb=# INSERT INTO lasso_data values ({1, 1}, 0.89);
00232 mydb=# INSERT INTO lasso_data values ({0.67, -0.06}, 0.3);
00233 ...
00234 mydb=# INSERT INTO lasso_data values ({0.15, -1.3}, -1.3);
00235 \endcode   
00236 -# learn coefficients, e.g.:  
00237 \code
00238 mydb=# SELECT madlib.elastic_net_train('en_data', 'en_model', 'ind_var', 'dep_var', 0.5, 0.1,
00239                                         True, 'linear', 'igd', 'stepsize = 0.1, warmup = t,
00240                                         warmup_lambda_no=3, warmup_lambdas = [0.4, 0.3, 0.2, 0.1],
00241                                         parallel=t', '1', 10000, 1e-6);
00242 \endcode
00243 \code
00244 mydb=# select madlib.elastic_net_predict(family, coef_all, intercept, ind_var)
00245 mydb-# from en_data, en_model;
00246 \endcode
00247 
00248 @literature
00249 
00250 [1] Elastic net regularization. http://en.wikipedia.org/wiki/Elastic_net_regularization
00251 
00252 [2] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorothm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202.   
00253 
00254 [3] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009. 
00255 
00256 @sa File elastic_net.sql_in documenting the SQL functions.
00257 
00258 */
00259 
00260 ------------------------------------------------------------------------
00261 
00262 /**
00263  * @brief Interface for elastic net
00264  *
00265  * @param tbl_source        Name of data source table
00266  * @param tbl_result        Name of the table to store the results
00267  * @param col_ind_var       Name of independent variable column, independent variable is an array
00268  * @param col_dep_var       Name of dependent variable column
00269  * @param regress_family    Response type (gaussian or binomial)
00270  * @param alpha             The elastic net parameter, [0, 1]
00271  * @param lambda_value            The regularization parameter
00272  * @param standardize   Whether to normalize the variables (default True)
00273  * @param grouping_col      List of columns on which to apply grouping 
00274  *                               (currently only a placeholder)
00275  * @param optimizer         The optimization algorithm, 'fista' or 'igd'. Default is 'fista'
00276  * @param optimizer_params  Parameters of the above optimizer, 
00277  *                                the format is 'arg = value, ...'. Default is NULL
00278  * @param exclude           Which columns to exclude? Default is NULL
00279  *                                 (applicable only if col_ind_var is set as * or a column of array,
00280  *                                   column names as 'col1, col2, ...' if col_ind_var is '*';
00281  *                                   element indices as '1,2,3, ...' if col_ind_var is a column of array)
00282  * @param max_iter  Maximum number of iterations to run the algorithm
00283  *                               (default value of 10000)
00284  * @param tolerance Iteration stopping criteria. Default is 1e-6
00285  */
00286 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00287     tbl_source          TEXT,
00288     tbl_result          TEXT,
00289     col_dep_var         TEXT,
00290     col_ind_var         TEXT,
00291     regress_family      TEXT,
00292     alpha               DOUBLE PRECISION,
00293     lambda_value        DOUBLE PRECISION,
00294     standardize         BOOLEAN,
00295     grouping_col        TEXT,
00296     optimizer           TEXT,
00297     optimizer_params    TEXT,
00298     excluded            TEXT,
00299     max_iter            INTEGER,
00300     tolerance           DOUBLE PRECISION
00301 ) RETURNS VOID AS $$
00302 PythonFunction(elastic_net, elastic_net, elastic_net_train)
00303 $$ LANGUAGE plpythonu;
00304 
00305 ------------------------------------------------------------------------
00306 -- Overloaded functions
00307 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00308     tbl_source          TEXT,
00309     tbl_result          TEXT,
00310     col_ind_var         TEXT,
00311     col_dep_var         TEXT,
00312     regress_family      TEXT,
00313     alpha               DOUBLE PRECISION,
00314     lambda_value        DOUBLE PRECISION,
00315     standardization     BOOLEAN,
00316     grouping_columns    TEXT,
00317     optimizer           TEXT,
00318     optimizer_params    TEXT,
00319     excluded            TEXT,
00320     max_iter            INTEGER
00321 ) RETURNS VOID AS $$
00322 BEGIN
00323     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00324         $9, $10, $11, $12, $13, 1e-6);
00325 END;
00326 $$ LANGUAGE plpgsql VOLATILE;
00327 
00328 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00329     tbl_source          TEXT,
00330     tbl_result          TEXT,
00331     col_ind_var         TEXT,
00332     col_dep_var         TEXT,
00333     regress_family      TEXT,
00334     alpha               DOUBLE PRECISION,
00335     lambda_value        DOUBLE PRECISION,
00336     standardization     BOOLEAN,
00337     grouping_columns    TEXT,
00338     optimizer           TEXT,
00339     optimizer_params    TEXT,
00340     excluded            TEXT
00341 ) RETURNS VOID AS $$
00342 BEGIN
00343     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00344         $9, $10, $11, $12, 10000);
00345 END;
00346 $$ LANGUAGE plpgsql VOLATILE;
00347 
00348 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00349     tbl_source          TEXT,
00350     tbl_result          TEXT,
00351     col_ind_var         TEXT,
00352     col_dep_var         TEXT,
00353     regress_family      TEXT,
00354     alpha               DOUBLE PRECISION,
00355     lambda_value        DOUBLE PRECISION,
00356     standardization     BOOLEAN,
00357     grouping_columns    TEXT,
00358     optimizer           TEXT,
00359     optimizer_params    TEXT
00360 ) RETURNS VOID AS $$
00361 BEGIN
00362     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00363         $9, $10, $11, NULL);
00364 END;
00365 $$ LANGUAGE plpgsql VOLATILE;
00366 
00367 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00368     tbl_source          TEXT,
00369     tbl_result          TEXT,
00370     col_ind_var         TEXT,
00371     col_dep_var         TEXT,
00372     regress_family      TEXT,
00373     alpha               DOUBLE PRECISION,
00374     lambda_value        DOUBLE PRECISION,
00375     standardization     BOOLEAN,
00376     grouping_columns    TEXT,
00377     optimizer           TEXT
00378 ) RETURNS VOID AS $$
00379 BEGIN
00380     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00381         $9, $10, NULL::TEXT);
00382 END;
00383 $$ LANGUAGE plpgsql VOLATILE;
00384 
00385 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00386     tbl_source          TEXT,
00387     tbl_result          TEXT,
00388     col_ind_var         TEXT,
00389     col_dep_var         TEXT,
00390     regress_family      TEXT,
00391     alpha               DOUBLE PRECISION,
00392     lambda_value        DOUBLE PRECISION,
00393     standardization     BOOLEAN,
00394     grouping_columns    TEXT
00395 ) RETURNS VOID AS $$
00396 BEGIN
00397     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00398         $9, 'FISTA');
00399 END;
00400 $$ LANGUAGE plpgsql VOLATILE;
00401 
00402 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00403     tbl_source          TEXT,
00404     tbl_result          TEXT,
00405     col_ind_var         TEXT,
00406     col_dep_var         TEXT,
00407     regress_family      TEXT,
00408     alpha               DOUBLE PRECISION,
00409     lambda_value        DOUBLE PRECISION,
00410     standardization     BOOLEAN
00411 ) RETURNS VOID AS $$
00412 BEGIN
00413     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 
00414         NULL);
00415 END;
00416 $$ LANGUAGE plpgsql VOLATILE;
00417 
00418 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00419     tbl_source          TEXT,
00420     tbl_result          TEXT,
00421     col_ind_var         TEXT,
00422     col_dep_var         TEXT,
00423     regress_family      TEXT,
00424     alpha               DOUBLE PRECISION,
00425     lambda_value        DOUBLE PRECISION
00426 ) RETURNS VOID AS $$
00427 BEGIN
00428     PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, True);
00429 END;
00430 $$ LANGUAGE plpgsql VOLATILE;
00431 
00432 ------------------------------------------------------------------------
00433 
00434 /**
00435  * @brief Help function, to print out the supported families
00436  */
00437 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ()
00438 RETURNS TEXT AS $$
00439 PythonFunction(elastic_net, elastic_net, elastic_net_help)
00440 $$ LANGUAGE plpythonu;
00441 
00442 ------------------------------------------------------------------------
00443 
00444 /**
00445  * @brief Help function, to print out the supported optimizer for a family
00446  * or print out the parameter list for an optimizer
00447  *
00448  * @param family_or_optimizer   Response type, 'gaussian' or 'binomial', or
00449  * optimizer type
00450  */
00451 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train (
00452     family_or_optimizer  TEXT
00453 ) RETURNS TEXT AS $$
00454 PythonFunction(elastic_net, elastic_net, elastic_net_help)
00455 $$ LANGUAGE plpythonu;
00456 
00457 ------------------------------------------------------------------------
00458 ------------------------------------------------------------------------
00459 ------------------------------------------------------------------------
00460 
00461 /**
00462  * @brief Prediction and put the result in a table
00463  *        can be used together with General-CV
00464  * @param tbl_model The result from elastic_net_train
00465  * @param tbl_new_source Data table
00466  * @param col_id Unique ID associated with each row
00467  * @param tbl_predict Prediction result
00468  */
00469 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict (
00470     tbl_model       TEXT,
00471     tbl_new_source  TEXT,
00472     col_id          TEXT,
00473     tbl_predict     TEXT
00474 ) RETURNS VOID AS $$
00475 PythonFunction(elastic_net, elastic_net, elastic_net_predict_all)
00476 $$ LANGUAGE plpythonu;
00477 
00478 ------------------------------------------------------------------------
00479 
00480 /**
00481  * @brief Prediction use learned coefficients for a given example
00482  *
00483  * @param regress_family    model family
00484  * @param coefficients      The fitting coefficients
00485  * @param intercept         The fitting intercept
00486  * @param ind_var           Features (independent variables)
00487  *
00488  * returns a double value. When regress_family is 'binomial' or 'logistic',
00489  * this function returns 1 for True and 0 for False
00490  */
00491 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict (
00492     regress_family  TEXT,
00493     coefficients    DOUBLE PRECISION[],
00494     intercept       DOUBLE PRECISION,
00495     ind_var         DOUBLE PRECISION[]
00496 ) RETURNS DOUBLE PRECISION AS $$
00497 DECLARE
00498     family_name     TEXT;
00499     binomial_result BOOLEAN;
00500 BEGIN
00501     family_name := lower(regress_family);
00502 
00503     IF family_name = 'gaussian' OR family_name = 'linear' THEN
00504         RETURN MADLIB_SCHEMA.elastic_net_gaussian_predict(coefficients, intercept, ind_var);
00505     END IF;
00506 
00507     IF family_name = 'binomial' OR family_name = 'logistic' THEN
00508         binomial_result := MADLIB_SCHEMA.elastic_net_binomial_predict(coefficients, intercept, ind_var);
00509         IF binomial_result THEN
00510             return 1;
00511         ELSE
00512             return 0;
00513         END IF;
00514     END IF;
00515 
00516     RAISE EXCEPTION 'This regression family is not supported!';
00517 END;
00518 $$ LANGUAGE plpgsql IMMUTABLE STRICT;
00519 
00520 ------------------------------------------------------------------------
00521 
00522  /**
00523  * @brief Prediction for linear models use learned coefficients for a given example
00524  *
00525  * @param coefficients      Linear fitting coefficients
00526  * @param intercept         Linear fitting intercept
00527  * @param ind_var           Features (independent variables)
00528  *
00529  * returns a double value
00530  */
00531 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_gaussian_predict (
00532     coefficients    DOUBLE PRECISION[],
00533     intercept       DOUBLE PRECISION,
00534     ind_var         DOUBLE PRECISION[]
00535 ) RETURNS DOUBLE PRECISION AS
00536 'MODULE_PATHNAME', '__elastic_net_gaussian_predict'
00537 LANGUAGE C IMMUTABLE STRICT;
00538 
00539 ------------------------------------------------------------------------
00540 /**
00541  * @brief Prediction for logistic models use learned coefficients for a given example
00542  *
00543  * @param coefficients      Logistic fitting coefficients
00544  * @param intercept         Logistic fitting intercept
00545  * @param ind_var           Features (independent variables)
00546  *
00547  * returns a boolean value
00548  */
00549 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_predict (
00550     coefficients    DOUBLE PRECISION[],
00551     intercept       DOUBLE PRECISION,
00552     ind_var         DOUBLE PRECISION[]
00553 ) RETURNS BOOLEAN AS
00554 'MODULE_PATHNAME', '__elastic_net_binomial_predict'
00555 LANGUAGE C IMMUTABLE STRICT;
00556 
00557 ------------------------------------------------------------------------
00558 /**
00559  * @brief Compute the probability of belonging to the True class for a given observation
00560  *
00561  * @param coefficients      Logistic fitting coefficients
00562  * @param intercept         Logistic fitting intercept
00563  * @param ind_var           Features (independent variables)
00564  *
00565  * returns a double value, which is the probability of this data point being True class
00566  */
00567 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_prob (
00568     coefficients    DOUBLE PRECISION[],
00569     intercept       DOUBLE PRECISION,
00570     ind_var         DOUBLE PRECISION[]
00571 ) RETURNS DOUBLE PRECISION AS
00572 'MODULE_PATHNAME', '__elastic_net_binomial_prob'
00573 LANGUAGE C IMMUTABLE STRICT;
00574 
00575 ------------------------------------------------------------------------
00576 /* Compute the log-likelihood for one data point */
00577 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__elastic_net_binomial_loglikelihood (
00578     coefficients    DOUBLE PRECISION[],
00579     intercept       DOUBLE PRECISION,
00580     dep_var         BOOLEAN,
00581     ind_var         DOUBLE PRECISION[]
00582 ) RETURNS DOUBLE PRECISION AS
00583 'MODULE_PATHNAME', '__elastic_net_binomial_loglikelihood'
00584 LANGUAGE C IMMUTABLE STRICT;
00585 
00586 ------------------------------------------------------------------------
00587 -- Compute the solution for just one step ------------------------------
00588 ------------------------------------------------------------------------
00589 
00590 CREATE TYPE MADLIB_SCHEMA.__elastic_net_result AS (
00591     intercept       DOUBLE PRECISION,
00592     coefficients    DOUBLE PRECISION[],
00593     lambda_value    DOUBLE PRECISION
00594 );
00595 
00596 ------------------------------------------------------------------------
00597 
00598 /* IGD */
00599 
00600 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_transition (
00601     state               DOUBLE PRECISION[],
00602     ind_var             DOUBLE PRECISION[],
00603     dep_var             DOUBLE PRECISION,
00604     pre_state           DOUBLE PRECISION[],
00605     lambda              DOUBLE PRECISION,
00606     alpha               DOUBLE PRECISION,
00607     dimension           INTEGER,
00608     stepsize            DOUBLE PRECISION,
00609     total_rows          INTEGER,
00610     xmean               DOUBLE PRECISION[],
00611     ymean               DOUBLE PRECISION,
00612     step_decay          DOUBLE PRECISION
00613 ) RETURNS DOUBLE PRECISION[]
00614 AS 'MODULE_PATHNAME', 'gaussian_igd_transition'
00615 LANGUAGE C IMMUTABLE;
00616 
00617 --
00618 
00619 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_merge (
00620     state1              DOUBLE PRECISION[],
00621     state2              DOUBLE PRECISION[]
00622 ) RETURNS DOUBLE PRECISION[] AS
00623 'MODULE_PATHNAME', 'gaussian_igd_merge'
00624 LANGUAGE C IMMUTABLE STRICT;
00625 
00626 --
00627 
00628 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_final (
00629     state               DOUBLE PRECISION[]
00630 ) RETURNS DOUBLE PRECISION[] AS
00631 'MODULE_PATHNAME', 'gaussian_igd_final'
00632 LANGUAGE C IMMUTABLE STRICT;
00633 
00634 /*
00635  * Perform one iteration step of IGD for linear models
00636  */
00637 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step(
00638     /* ind_var */           DOUBLE PRECISION[],
00639     /* dep_var */           DOUBLE PRECISION,
00640     /* pre_state */         DOUBLE PRECISION[],
00641     /* lambda  */           DOUBLE PRECISION,
00642     /* alpha */             DOUBLE PRECISION,
00643     /* dimension */         INTEGER,
00644     /* stepsize */          DOUBLE PRECISION,
00645     /* total_rows */        INTEGER,
00646     /* xmeans */            DOUBLE PRECISION[],
00647     /* ymean */             DOUBLE PRECISION,
00648     /* step_decay */        DOUBLE PRECISION
00649 ) (
00650     SType = DOUBLE PRECISION[],
00651     SFunc = MADLIB_SCHEMA.__gaussian_igd_transition,
00652     m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,')
00653     FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final,
00654     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00655 );
00656 
00657 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step_single_seg (
00658     /* ind_var */           DOUBLE PRECISION[],
00659     /* dep_var */           DOUBLE PRECISION,
00660     /* pre_state */         DOUBLE PRECISION[],
00661     /* lambda  */           DOUBLE PRECISION,
00662     /* alpha */             DOUBLE PRECISION,
00663     /* dimension */         INTEGER,
00664     /* stepsize */          DOUBLE PRECISION,
00665     /* total_rows */        INTEGER,
00666     /* xmeans */            DOUBLE PRECISION[],
00667     /* ymean */             DOUBLE PRECISION,
00668     /* step_decay */        DOUBLE PRECISION
00669 ) (
00670     SType = DOUBLE PRECISION[],
00671     SFunc = MADLIB_SCHEMA.__gaussian_igd_transition,
00672     -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,')
00673     FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final,
00674     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00675 );
00676 
00677 --
00678 
00679 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_state_diff (
00680     state1          DOUBLE PRECISION[],
00681     state2          DOUBLE PRECISION[]
00682 ) RETURNS DOUBLE PRECISION AS
00683 'MODULE_PATHNAME', '__gaussian_igd_state_diff'
00684 LANGUAGE C IMMUTABLE STRICT;
00685 
00686 --
00687 
00688 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_result (
00689     in_state        DOUBLE PRECISION[],
00690     feature_sq      DOUBLE PRECISION[],
00691     threshold       DOUBLE PRECISION,
00692     tolerance       DOUBLE PRECISION
00693 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
00694 'MODULE_PATHNAME', '__gaussian_igd_result'
00695 LANGUAGE C IMMUTABLE STRICT;
00696 
00697 ------------------------------------------------------------------------
00698 
00699 /* FISTA */
00700 
00701 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_transition (
00702     state               DOUBLE PRECISION[],
00703     ind_var             DOUBLE PRECISION[],
00704     dep_var             DOUBLE PRECISION,
00705     pre_state           DOUBLE PRECISION[],
00706     lambda              DOUBLE PRECISION,
00707     alpha               DOUBLE PRECISION,
00708     dimension           INTEGER,
00709     total_rows          INTEGER,
00710     max_stepsize        DOUBLE PRECISION,
00711     eta                 DOUBLE PRECISION,
00712     use_active_set      INTEGER,
00713     is_active           INTEGER,
00714     random_stepsize     INTEGER
00715 ) RETURNS DOUBLE PRECISION[]
00716 AS 'MODULE_PATHNAME', 'gaussian_fista_transition'
00717 LANGUAGE C IMMUTABLE;
00718 
00719 --
00720 
00721 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_merge (
00722     state1              DOUBLE PRECISION[],
00723     state2              DOUBLE PRECISION[]
00724 ) RETURNS DOUBLE PRECISION[] AS
00725 'MODULE_PATHNAME', 'gaussian_fista_merge'
00726 LANGUAGE C IMMUTABLE STRICT;
00727 
00728 --
00729 
00730 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_final (
00731     state               DOUBLE PRECISION[]
00732 ) RETURNS DOUBLE PRECISION[] AS
00733 'MODULE_PATHNAME', 'gaussian_fista_final'
00734 LANGUAGE C IMMUTABLE STRICT;
00735 
00736 /*  
00737   Perform one iteration step of FISTA for linear models
00738  */
00739 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_fista_step(
00740     /* ind_var      */      DOUBLE PRECISION[],
00741     /* dep_var      */      DOUBLE PRECISION,
00742     /* pre_state    */      DOUBLE PRECISION[],
00743     /* lambda       */      DOUBLE PRECISION,
00744     /* alpha        */      DOUBLE PRECISION,
00745     /* dimension    */      INTEGER,
00746     /* total_rows   */      INTEGER,
00747     /* max_stepsize */      DOUBLE PRECISION,
00748     /* eta          */      DOUBLE PRECISION,
00749     /* use_active_set */    INTEGER,
00750     /* is_active */         INTEGER,
00751     /* random_stepsize */   INTEGER
00752 ) (
00753     SType = DOUBLE PRECISION[],
00754     SFunc = MADLIB_SCHEMA.__gaussian_fista_transition,
00755     m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_fista_merge,')
00756     FinalFunc = MADLIB_SCHEMA.__gaussian_fista_final,
00757     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00758 );
00759 
00760 --
00761 
00762 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_state_diff (
00763     state1          DOUBLE PRECISION[],
00764     state2          DOUBLE PRECISION[]
00765 ) RETURNS DOUBLE PRECISION AS
00766 'MODULE_PATHNAME', '__gaussian_fista_state_diff'
00767 LANGUAGE C IMMUTABLE STRICT;
00768 
00769 --
00770 
00771 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_result (
00772     in_state        DOUBLE PRECISION[]
00773 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
00774 'MODULE_PATHNAME', '__gaussian_fista_result'
00775 LANGUAGE C IMMUTABLE STRICT;
00776 
00777 ------------------------------------------------------------------------
00778 ------------------------------------------------------------------------
00779 ------------------------------------------------------------------------
00780 
00781 /* Binomial IGD */
00782 
00783 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_transition (
00784     state               DOUBLE PRECISION[],
00785     ind_var             DOUBLE PRECISION[],
00786     dep_var             BOOLEAN,
00787     pre_state           DOUBLE PRECISION[],
00788     lambda              DOUBLE PRECISION,
00789     alpha               DOUBLE PRECISION,
00790     dimension           INTEGER,
00791     stepsize            DOUBLE PRECISION,
00792     total_rows          INTEGER,
00793     xmean               DOUBLE PRECISION[],
00794     ymean               DOUBLE PRECISION,
00795     step_decay          DOUBLE PRECISION
00796 ) RETURNS DOUBLE PRECISION[]
00797 AS 'MODULE_PATHNAME', 'binomial_igd_transition'
00798 LANGUAGE C IMMUTABLE;
00799 
00800 --
00801 
00802 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_merge (
00803     state1              DOUBLE PRECISION[],
00804     state2              DOUBLE PRECISION[]
00805 ) RETURNS DOUBLE PRECISION[] AS
00806 'MODULE_PATHNAME', 'binomial_igd_merge'
00807 LANGUAGE C IMMUTABLE STRICT;
00808 
00809 --
00810 
00811 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_final (
00812     state               DOUBLE PRECISION[]
00813 ) RETURNS DOUBLE PRECISION[] AS
00814 'MODULE_PATHNAME', 'binomial_igd_final'
00815 LANGUAGE C IMMUTABLE STRICT;
00816 
00817 /*
00818  * Perform one iteration step of IGD for linear models
00819  */
00820 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step(
00821     /* ind_var */           DOUBLE PRECISION[],
00822     /* dep_var */           BOOLEAN,
00823     /* pre_state */         DOUBLE PRECISION[],
00824     /* lambda  */           DOUBLE PRECISION,
00825     /* alpha */             DOUBLE PRECISION,
00826     /* dimension */         INTEGER,
00827     /* stepsize */          DOUBLE PRECISION,
00828     /* total_rows */        INTEGER,
00829     /* xmeans */            DOUBLE PRECISION[],
00830     /* ymean */             DOUBLE PRECISION,
00831     /* step_decay */        DOUBLE PRECISION
00832 ) (
00833     SType = DOUBLE PRECISION[],
00834     SFunc = MADLIB_SCHEMA.__binomial_igd_transition,
00835     m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,')
00836     FinalFunc = MADLIB_SCHEMA.__binomial_igd_final,
00837     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00838 );
00839 
00840 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step_single_seg (
00841     /* ind_var */           DOUBLE PRECISION[],
00842     /* dep_var */           BOOLEAN,
00843     /* pre_state */         DOUBLE PRECISION[],
00844     /* lambda  */           DOUBLE PRECISION,
00845     /* alpha */             DOUBLE PRECISION,
00846     /* dimension */         INTEGER,
00847     /* stepsize */          DOUBLE PRECISION,
00848     /* total_rows */        INTEGER,
00849     /* xmeans */            DOUBLE PRECISION[],
00850     /* ymean */             DOUBLE PRECISION,
00851     /* step_decay */        DOUBLE PRECISION
00852 ) (
00853     SType = DOUBLE PRECISION[],
00854     SFunc = MADLIB_SCHEMA.__binomial_igd_transition,
00855     -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,')
00856     FinalFunc = MADLIB_SCHEMA.__binomial_igd_final,
00857     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00858 );
00859 
00860 --
00861 
00862 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_state_diff (
00863     state1          DOUBLE PRECISION[],
00864     state2          DOUBLE PRECISION[]
00865 ) RETURNS DOUBLE PRECISION AS
00866 'MODULE_PATHNAME', '__binomial_igd_state_diff'
00867 LANGUAGE C IMMUTABLE STRICT;
00868 
00869 --
00870 
00871 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_result (
00872     in_state        DOUBLE PRECISION[],
00873     feature_sq      DOUBLE PRECISION[],
00874     threshold       DOUBLE PRECISION,
00875     tolerance       DOUBLE PRECISION
00876 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
00877 'MODULE_PATHNAME', '__binomial_igd_result'
00878 LANGUAGE C IMMUTABLE STRICT;
00879 
00880 ------------------------------------------------------------------------
00881 
00882 /* Binomial FISTA */
00883 
00884 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_transition (
00885     state               DOUBLE PRECISION[],
00886     ind_var             DOUBLE PRECISION[],
00887     dep_var             BOOLEAN,
00888     pre_state           DOUBLE PRECISION[],
00889     lambda              DOUBLE PRECISION,
00890     alpha               DOUBLE PRECISION,
00891     dimension           INTEGER,
00892     total_rows          INTEGER,
00893     max_stepsize        DOUBLE PRECISION,
00894     eta                 DOUBLE PRECISION,
00895     use_active_set      INTEGER,
00896     is_active           INTEGER,
00897     random_stepsize     INTEGER
00898 ) RETURNS DOUBLE PRECISION[]
00899 AS 'MODULE_PATHNAME', 'binomial_fista_transition'
00900 LANGUAGE C IMMUTABLE;
00901 
00902 --
00903 
00904 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_merge (
00905     state1              DOUBLE PRECISION[],
00906     state2              DOUBLE PRECISION[]
00907 ) RETURNS DOUBLE PRECISION[] AS
00908 'MODULE_PATHNAME', 'binomial_fista_merge'
00909 LANGUAGE C IMMUTABLE STRICT;
00910 
00911 --
00912 
00913 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_final (
00914     state               DOUBLE PRECISION[]
00915 ) RETURNS DOUBLE PRECISION[] AS
00916 'MODULE_PATHNAME', 'binomial_fista_final'
00917 LANGUAGE C IMMUTABLE STRICT;
00918 
00919 /*
00920     Perform one iteration step of FISTA for linear models
00921  */
00922 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_fista_step(
00923     /* ind_var      */      DOUBLE PRECISION[],
00924     /* dep_var      */      BOOLEAN,
00925     /* pre_state    */      DOUBLE PRECISION[],
00926     /* lambda       */      DOUBLE PRECISION,
00927     /* alpha        */      DOUBLE PRECISION,
00928     /* dimension    */      INTEGER,
00929     /* total_rows   */      INTEGER,
00930     /* max_stepsize */      DOUBLE PRECISION,
00931     /* eta          */      DOUBLE PRECISION,
00932     /* use_active_set */    INTEGER,
00933     /* is_active */         INTEGER,
00934     /* random_stepsize */   INTEGER
00935 ) (
00936     SType = DOUBLE PRECISION[],
00937     SFunc = MADLIB_SCHEMA.__binomial_fista_transition,
00938     m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_fista_merge,')
00939     FinalFunc = MADLIB_SCHEMA.__binomial_fista_final,
00940     InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
00941 );
00942 
00943 --
00944 
00945 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_state_diff (
00946     state1          DOUBLE PRECISION[],
00947     state2          DOUBLE PRECISION[]
00948 ) RETURNS DOUBLE PRECISION AS
00949 'MODULE_PATHNAME', '__binomial_fista_state_diff'
00950 LANGUAGE C IMMUTABLE STRICT;
00951 
00952 --
00953 
00954 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_result (
00955     in_state        DOUBLE PRECISION[]
00956 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS
00957 'MODULE_PATHNAME', '__binomial_fista_result'
00958 LANGUAGE C IMMUTABLE STRICT;
00959 
00960