MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file elastic_net.sql_in 00004 * 00005 * @brief SQL functions for elastic net regularization 00006 * @date July 2012 00007 * 00008 * @sa For a brief introduction to elastic net, see the module 00009 * description \ref grp_lasso. 00010 * 00011 *//* ----------------------------------------------------------------------- */ 00012 00013 m4_include(`SQLCommon.m4') --' 00014 00015 /** 00016 @addtogroup grp_elasticnet 00017 00018 @about 00019 00020 This module implements the elastic net regularization for regression problems. 00021 00022 This method seeks to find a weight vector that, for any given training example set, minimizes: 00023 \f[\min_{w \in R^N} L(w) + \lambda \left(\frac{(1-\alpha)}{2} \|w\|_2^2 + \alpha \|w\|_1 \right)\f] 00024 where \f$L\f$ is the metric function that the user wants to minimize. Here \f$ \alpha \in [0,1] \f$ 00025 and \f$ lambda \geq 0 \f$. If \f$alpha = 0\f$, we have the ridge regularization (known also as Tikhonov regularization), and if \f$\alpha = 1\f$, we have the LASSO regularization. 00026 00027 For the Gaussian response family (or linear model), we have 00028 \f[L(\vec{w}) = \frac{1}{2}\left[\frac{1}{M} \sum_{m=1}^M (w^{t} x_m + w_{0} - y_m)^2 \right] 00029 \f] 00030 00031 For the Binomial response family (or logistic model), we have 00032 \f[ 00033 L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + 00034 \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + 00035 \vec{w}\cdot\vec{x}_m}\right)\right]\ , 00036 \f] 00037 where \f$y_m \in {0,1}\f$. 00038 00039 To get better convergence, one can rescale the value of each element of x 00040 \f[ x' \leftarrow \frac{x - \bar{x}}{\sigma_x} \f] 00041 and for Gaussian case we also let 00042 \f[y' \leftarrow y - \bar{y} \f] 00043 and then minimize with the regularization terms. 00044 At the end of the calculation, the orginal scales will be restored and an intercept term will be obtained at the same time as a by-product. 00045 00046 Note that fitting after scaling is not equivalent to directly fitting. 00047 00048 Right now, two optimizers are supported. The default one is FISTA, and the other is IGD. They have their own parameters, which can be specified in the <em>optimizer_params</em> as a text array. For example, 'max_stepsize = 0.1, warmup = t, warmup_lambdas = [0.4, 0.3, 0.2]'. 00049 00050 <b>(1) FISTA</b> 00051 00052 Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [2] has the following optimizer-specific parameters: 00053 00054 max_stepsize - default is 4.0 00055 eta - default is 2, if stepsize does not work 00056 stepsize/eta will be tried 00057 warmup - default is False 00058 warmup_lambdas - default is NULL, which means that lambda 00059 values will be automatically generated 00060 warmup_lambda_no - default is 15. How many lambda's are used in 00061 warm-up, will be overridden if warmup_lambdas 00062 is not NULL 00063 warmup_tolerance - default is the same as tolerance. The value 00064 of tolerance used during warmup. 00065 use_active_set - default is False. Whether to use active-set 00066 method to speed up the computation. 00067 activeset_tolerance - default is the same as tolerance. The 00068 value of tolerance used during active set 00069 calculation 00070 random_stepsize - default is False. Whether add some randomness 00071 to the step size. Sometimes, this can speed 00072 up the calculation. 00073 00074 Here, backtracking for step size is used. At each iteration, we first try the <em>stepsize = max_stepsize</em>, and if it does not work out, we then try a smaller step size <em>stepsize = stepsize / eta</em>, where <em>eta</em> must be larger than 1. At first sight, this seems to do repeated iterations for even one step, but it actually greatly increases the computation speed by using a larger step size and minimizes the total number of iterations. A careful choice of max_stepsize can decrease the computation time by more than 10 times. 00075 00076 If <em>warmup</em> is <em>True</em>, a series of lambda values, which is strictly descent and ends at the lambda value that the user wants to calculate, will be used. The larger lambda gives very sparse solution, and the sparse solution again is used as the initial guess for the next lambda's solution, which will speed up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and might be faster than computation on only one lambda value. 00077 00078 If <em>use_active_set</em> is <em>True</em>, active-set method will be used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features— those with nonzero coefficients. After a complete cycle through all the variables, we iterate on only the active set till convergence. If another complete cycle does not change the active set, we are done, otherwise the process is repeated. 00079 00080 <b>(2) IGD</b> 00081 00082 Incremental Gradient Descent (IGD) or Stochastic Gradient Descent (SGD) [3] has the following optimizer-specific parameters: 00083 00084 stepsize - default is 0.01 00085 threshold - default is 1e-10. When a coefficient is really 00086 small, set it to be 0 00087 warmup - default is False 00088 warmup_lambdas - default is Null 00089 warmup_lambda_no - default is 15. How many lambda's are used in 00090 warm-up, will be overridden if warmup_lambdas 00091 is not NULL 00092 warmup_tolerance - default is the same as tolerance. The value 00093 of tolerance used during warmup. 00094 parallel - default is True. Run the computation on 00095 multiple segments or not. 00096 00097 Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, <em>threshold</em> is needed at the end of the computation to screen out those tiny values and just hard set them to be zeros. This is done as the following: (1) multiply each coefficient with the standard deviation of the corresponding feature (2) compute the average of absolute values of re-scaled coefficients (3) divide each rescaled coefficients with the average, and if the resulting absolute value is smaller than <em>threshold</em>, set the original coefficient to be zero. 00098 00099 SGD is in nature a sequential algorithm, and when running in a distributed way, each segment of the data runs its own SGD model, and the models are averaged to get a model for each iteration. This average might slow down the convergence speed, although we acquire the ability to process large data set on multiple machines. So this algorithm provides an option <em>parallel</em> to let the user choose whether to do parallel computation. 00100 00101 <b>Stopping Criteria</b> Both optimizers compute the average difference between the coefficients of two consecutive iterations, and if the difference is smaller than <em>tolerance</em> or the iteration number is larger than <em>max_iter</em>, the computation stops. 00102 00103 <b>Online Help</b> The user can read short help messages by using any one of the following 00104 \code 00105 SELECT madlib.elastic_net_train(); 00106 SELECT madlib.elastic_net_train('usage'); 00107 SELECT madlib.elastic_net_train('predict'); 00108 SELECT madlib.elastic_net_train('gaussian'); 00109 SELECT madlib.elastic_net_train('binomial'); 00110 SELECT madlib.elastic_net_train('linear'); 00111 SELECT madlib.elastic_net_train('fista'); 00112 SELECT madlib.elastic_net_train('igd'); 00113 \endcode 00114 00115 @input 00116 00117 The <b>training examples</b> is expected to be of the following form: 00118 <pre>{TABLE|VIEW} <em>input_table</em> ( 00119 ... 00120 <em>independentVariables</em> DOUBLE PRECISION[], 00121 <em>dependentVariable</em> DOUBLE PRECISION, 00122 ... 00123 )</pre> 00124 00125 Null values are not expected. 00126 00127 @usage 00128 00129 <b>Pre-run :</b> Usually one gets better results and faster convergence using <em>standardize = True</em>. <b>It is highly recommended to run <em>elastic_net_train</em> function on a subset of the data with limited <em>max_iter</em> before applying it onto the full data set with a large <em>max_iter</em>. In the pre-run, the user can tweak the parameters to get the best performance and then apply the best set of parameters onto the whole data set.</b> 00130 00131 - Get the fitting coefficients for a linear model: 00132 00133 <pre> 00134 SELECT {schema_madlib}.elastic_net_train ( 00135 'tbl_source', -- Data table 00136 'tbl_result', -- Result table 00137 'col_dep_var', -- Dependent variable, can be an expression or 00138 '*' 00139 'col_ind_var', -- Independent variable, can be an expression 00140 'regress_family', -- 'gaussian' (or 'linear'). 'binomial' 00141 (or 'logistic') will be supported 00142 alpha, -- Elastic net control parameter, value in [0, 1] 00143 lambda_value, -- Regularization parameter, positive 00144 standardize, -- Whether to normalize the data. Default: True 00145 'grouping_col', -- Group by which columns. Default: NULL 00146 'optimizer', -- Name of optimizer. Default: 'fista' 00147 'optimizer_params',-- Optimizer parameters, delimited by comma. Default: NULL 00148 'excluded', -- Column names excluded from '*'. Default: NULL 00149 max_iter, -- Maximum iteration number. Default: 10000 00150 tolerance -- Stopping criteria. Default: 1e-6 00151 ); 00152 </pre> 00153 00154 If <em>col_ind_var</em> is '*', then all columns of <em>tbl_source</em> will be used as features except those listed in the <em>excluded</em> string. If the dependent variable is a column name, it is then automatically excluded from the features. However, if the dependent variable is a valid Postgres expression, then the column names inside this expression are not excluded unless explicitly put into the <em>excluded</em> list. So it is a good idea to put all column names involved in the dependent variable expression into the <em>excluded</em> string. 00155 00156 The <em>excluded</em> string is a list of column names excluded from features delimited by comma. For example, 'col1, col2'. If it is NULL or an empty string '', no column is excluded. 00157 00158 If <em>col_ind_var</em> is a single column name, which is the array type, one can still use <em>excluded</em>. For example, if <em>x</em> is a column name, which is an array of size 1000, and the user wants to exclude the 100-th, 200-th and 301-th elements of the array, he can set <em>excluded</em> to be '100, 200, 301'. 00159 00160 Both <em>col_dep_var</em> and <em>col_ind_var</em> can be valid Postgres expression. For example, <em>col_dep_var = 'log(y+1)'</em>, and <em>col_ind_var = 'array[exp(x[1]), x[2], 1/(1+x[3])]'</em> etc. In the binomial case, one can set <em>col_dep_var = 'y < 0'</em> etc. 00161 00162 Output: 00163 <pre> family | features | features_selected | coef_nonzero | coef_all | intercept | log_likelihood | standardize | iteration_run 00164 ------------------+------------+------------+------------+--------------+-------------+--------+--------+----------- 00165 ... 00166 </pre> 00167 00168 where <em>log_likelihood</em> is just the negative value of the first equation above (up to a constant depending on the data set). 00169 00170 - Get the \b prediction on a data set using a linear model: 00171 <pre> 00172 SELECT madlib.elastic_net_predict( 00173 '<em>regress_family</em>', -- Response type, 'gaussian' ('linear') or 'binomial' ('logistic') 00174 <em>coefficients</em>, -- fitting coefficients 00175 <em>intercept</em>, -- fitting intercept 00176 <em>independent Variables</em> 00177 ) from tbl_data, tbl_train_result; 00178 </pre> 00179 The above function returns a double value for each data point. 00180 When predicting with binomial models, the return value is 1 00181 if the predicted result is True, and 0 if the prediction is 00182 False. 00183 00184 <b>Or</b> 00185 (1) 00186 <pre> 00187 SELECT madlib.elastic_net_gaussian_predict ( 00188 coefficients, intercept, ind_var 00189 ) FROM tbl_result, tbl_new_source LIMIT 10; 00190 </pre> 00191 00192 (2) 00193 <pre> 00194 SELECT madlib.elastic_net_binomial_predict ( 00195 coefficients, intercept, ind_var 00196 ) FROM tbl_result, tbl_new_source LIMIT 10; 00197 </pre> 00198 00199 This returns 10 BOOLEAN values. 00200 00201 (3) 00202 <pre> 00203 SELECT madlib.elastic_net_binomial_prob ( 00204 coefficients, intercept, ind_var 00205 ) FROM tbl_result, tbl_new_source LIMIT 10; 00206 </pre> 00207 00208 This returns 10 probability values for True class. 00209 00210 <b>Or</b> The user can use another prediction function which stores the prediction result in a table. This is usefule if the user wants to use elastic net together with general cross validation function. 00211 <pre> 00212 SELECT madlib.elastic_net_predict( 00213 '<em>tbl_train_result</em>', 00214 '<em>tbl_data</em>', 00215 '<em>col_id</em>', -- ID associated with each row 00216 '<em>tbl_predict</em>' -- Prediction result 00217 ); 00218 </pre> 00219 00220 @examp 00221 00222 -# Prepare an input table/view: 00223 \code 00224 CREATE TABLE en_data ( 00225 ind_var DOUBLE PRECISION[], 00226 dep_var DOUBLE PRECISION 00227 ); 00228 \endcode 00229 -# Populate the input table with some data, which should be well-conditioned, e.g.: 00230 \code 00231 mydb=# INSERT INTO lasso_data values ({1, 1}, 0.89); 00232 mydb=# INSERT INTO lasso_data values ({0.67, -0.06}, 0.3); 00233 ... 00234 mydb=# INSERT INTO lasso_data values ({0.15, -1.3}, -1.3); 00235 \endcode 00236 -# learn coefficients, e.g.: 00237 \code 00238 mydb=# SELECT madlib.elastic_net_train('en_data', 'en_model', 'ind_var', 'dep_var', 0.5, 0.1, 00239 True, 'linear', 'igd', 'stepsize = 0.1, warmup = t, 00240 warmup_lambda_no=3, warmup_lambdas = [0.4, 0.3, 0.2, 0.1], 00241 parallel=t', '1', 10000, 1e-6); 00242 \endcode 00243 \code 00244 mydb=# select madlib.elastic_net_predict(family, coef_all, intercept, ind_var) 00245 mydb-# from en_data, en_model; 00246 \endcode 00247 00248 @literature 00249 00250 [1] Elastic net regularization. http://en.wikipedia.org/wiki/Elastic_net_regularization 00251 00252 [2] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorothm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202. 00253 00254 [3] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009. 00255 00256 @sa File elastic_net.sql_in documenting the SQL functions. 00257 00258 */ 00259 00260 ------------------------------------------------------------------------ 00261 00262 /** 00263 * @brief Interface for elastic net 00264 * 00265 * @param tbl_source Name of data source table 00266 * @param tbl_result Name of the table to store the results 00267 * @param col_ind_var Name of independent variable column, independent variable is an array 00268 * @param col_dep_var Name of dependent variable column 00269 * @param regress_family Response type (gaussian or binomial) 00270 * @param alpha The elastic net parameter, [0, 1] 00271 * @param lambda_value The regularization parameter 00272 * @param standardize Whether to normalize the variables (default True) 00273 * @param grouping_col List of columns on which to apply grouping 00274 * (currently only a placeholder) 00275 * @param optimizer The optimization algorithm, 'fista' or 'igd'. Default is 'fista' 00276 * @param optimizer_params Parameters of the above optimizer, 00277 * the format is 'arg = value, ...'. Default is NULL 00278 * @param exclude Which columns to exclude? Default is NULL 00279 * (applicable only if col_ind_var is set as * or a column of array, 00280 * column names as 'col1, col2, ...' if col_ind_var is '*'; 00281 * element indices as '1,2,3, ...' if col_ind_var is a column of array) 00282 * @param max_iter Maximum number of iterations to run the algorithm 00283 * (default value of 10000) 00284 * @param tolerance Iteration stopping criteria. Default is 1e-6 00285 */ 00286 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00287 tbl_source TEXT, 00288 tbl_result TEXT, 00289 col_dep_var TEXT, 00290 col_ind_var TEXT, 00291 regress_family TEXT, 00292 alpha DOUBLE PRECISION, 00293 lambda_value DOUBLE PRECISION, 00294 standardize BOOLEAN, 00295 grouping_col TEXT, 00296 optimizer TEXT, 00297 optimizer_params TEXT, 00298 excluded TEXT, 00299 max_iter INTEGER, 00300 tolerance DOUBLE PRECISION 00301 ) RETURNS VOID AS $$ 00302 PythonFunction(elastic_net, elastic_net, elastic_net_train) 00303 $$ LANGUAGE plpythonu; 00304 00305 ------------------------------------------------------------------------ 00306 -- Overloaded functions 00307 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00308 tbl_source TEXT, 00309 tbl_result TEXT, 00310 col_ind_var TEXT, 00311 col_dep_var TEXT, 00312 regress_family TEXT, 00313 alpha DOUBLE PRECISION, 00314 lambda_value DOUBLE PRECISION, 00315 standardization BOOLEAN, 00316 grouping_columns TEXT, 00317 optimizer TEXT, 00318 optimizer_params TEXT, 00319 excluded TEXT, 00320 max_iter INTEGER 00321 ) RETURNS VOID AS $$ 00322 BEGIN 00323 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00324 $9, $10, $11, $12, $13, 1e-6); 00325 END; 00326 $$ LANGUAGE plpgsql VOLATILE; 00327 00328 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00329 tbl_source TEXT, 00330 tbl_result TEXT, 00331 col_ind_var TEXT, 00332 col_dep_var TEXT, 00333 regress_family TEXT, 00334 alpha DOUBLE PRECISION, 00335 lambda_value DOUBLE PRECISION, 00336 standardization BOOLEAN, 00337 grouping_columns TEXT, 00338 optimizer TEXT, 00339 optimizer_params TEXT, 00340 excluded TEXT 00341 ) RETURNS VOID AS $$ 00342 BEGIN 00343 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00344 $9, $10, $11, $12, 10000); 00345 END; 00346 $$ LANGUAGE plpgsql VOLATILE; 00347 00348 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00349 tbl_source TEXT, 00350 tbl_result TEXT, 00351 col_ind_var TEXT, 00352 col_dep_var TEXT, 00353 regress_family TEXT, 00354 alpha DOUBLE PRECISION, 00355 lambda_value DOUBLE PRECISION, 00356 standardization BOOLEAN, 00357 grouping_columns TEXT, 00358 optimizer TEXT, 00359 optimizer_params TEXT 00360 ) RETURNS VOID AS $$ 00361 BEGIN 00362 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00363 $9, $10, $11, NULL); 00364 END; 00365 $$ LANGUAGE plpgsql VOLATILE; 00366 00367 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00368 tbl_source TEXT, 00369 tbl_result TEXT, 00370 col_ind_var TEXT, 00371 col_dep_var TEXT, 00372 regress_family TEXT, 00373 alpha DOUBLE PRECISION, 00374 lambda_value DOUBLE PRECISION, 00375 standardization BOOLEAN, 00376 grouping_columns TEXT, 00377 optimizer TEXT 00378 ) RETURNS VOID AS $$ 00379 BEGIN 00380 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00381 $9, $10, NULL::TEXT); 00382 END; 00383 $$ LANGUAGE plpgsql VOLATILE; 00384 00385 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00386 tbl_source TEXT, 00387 tbl_result TEXT, 00388 col_ind_var TEXT, 00389 col_dep_var TEXT, 00390 regress_family TEXT, 00391 alpha DOUBLE PRECISION, 00392 lambda_value DOUBLE PRECISION, 00393 standardization BOOLEAN, 00394 grouping_columns TEXT 00395 ) RETURNS VOID AS $$ 00396 BEGIN 00397 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00398 $9, 'FISTA'); 00399 END; 00400 $$ LANGUAGE plpgsql VOLATILE; 00401 00402 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00403 tbl_source TEXT, 00404 tbl_result TEXT, 00405 col_ind_var TEXT, 00406 col_dep_var TEXT, 00407 regress_family TEXT, 00408 alpha DOUBLE PRECISION, 00409 lambda_value DOUBLE PRECISION, 00410 standardization BOOLEAN 00411 ) RETURNS VOID AS $$ 00412 BEGIN 00413 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, $8, 00414 NULL); 00415 END; 00416 $$ LANGUAGE plpgsql VOLATILE; 00417 00418 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00419 tbl_source TEXT, 00420 tbl_result TEXT, 00421 col_ind_var TEXT, 00422 col_dep_var TEXT, 00423 regress_family TEXT, 00424 alpha DOUBLE PRECISION, 00425 lambda_value DOUBLE PRECISION 00426 ) RETURNS VOID AS $$ 00427 BEGIN 00428 PERFORM MADLIB_SCHEMA.elastic_net_train($1, $2, $3, $4, $5, $6, $7, True); 00429 END; 00430 $$ LANGUAGE plpgsql VOLATILE; 00431 00432 ------------------------------------------------------------------------ 00433 00434 /** 00435 * @brief Help function, to print out the supported families 00436 */ 00437 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train () 00438 RETURNS TEXT AS $$ 00439 PythonFunction(elastic_net, elastic_net, elastic_net_help) 00440 $$ LANGUAGE plpythonu; 00441 00442 ------------------------------------------------------------------------ 00443 00444 /** 00445 * @brief Help function, to print out the supported optimizer for a family 00446 * or print out the parameter list for an optimizer 00447 * 00448 * @param family_or_optimizer Response type, 'gaussian' or 'binomial', or 00449 * optimizer type 00450 */ 00451 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_train ( 00452 family_or_optimizer TEXT 00453 ) RETURNS TEXT AS $$ 00454 PythonFunction(elastic_net, elastic_net, elastic_net_help) 00455 $$ LANGUAGE plpythonu; 00456 00457 ------------------------------------------------------------------------ 00458 ------------------------------------------------------------------------ 00459 ------------------------------------------------------------------------ 00460 00461 /** 00462 * @brief Prediction and put the result in a table 00463 * can be used together with General-CV 00464 * @param tbl_model The result from elastic_net_train 00465 * @param tbl_new_source Data table 00466 * @param col_id Unique ID associated with each row 00467 * @param tbl_predict Prediction result 00468 */ 00469 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict ( 00470 tbl_model TEXT, 00471 tbl_new_source TEXT, 00472 col_id TEXT, 00473 tbl_predict TEXT 00474 ) RETURNS VOID AS $$ 00475 PythonFunction(elastic_net, elastic_net, elastic_net_predict_all) 00476 $$ LANGUAGE plpythonu; 00477 00478 ------------------------------------------------------------------------ 00479 00480 /** 00481 * @brief Prediction use learned coefficients for a given example 00482 * 00483 * @param regress_family model family 00484 * @param coefficients The fitting coefficients 00485 * @param intercept The fitting intercept 00486 * @param ind_var Features (independent variables) 00487 * 00488 * returns a double value. When regress_family is 'binomial' or 'logistic', 00489 * this function returns 1 for True and 0 for False 00490 */ 00491 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_predict ( 00492 regress_family TEXT, 00493 coefficients DOUBLE PRECISION[], 00494 intercept DOUBLE PRECISION, 00495 ind_var DOUBLE PRECISION[] 00496 ) RETURNS DOUBLE PRECISION AS $$ 00497 DECLARE 00498 family_name TEXT; 00499 binomial_result BOOLEAN; 00500 BEGIN 00501 family_name := lower(regress_family); 00502 00503 IF family_name = 'gaussian' OR family_name = 'linear' THEN 00504 RETURN MADLIB_SCHEMA.elastic_net_gaussian_predict(coefficients, intercept, ind_var); 00505 END IF; 00506 00507 IF family_name = 'binomial' OR family_name = 'logistic' THEN 00508 binomial_result := MADLIB_SCHEMA.elastic_net_binomial_predict(coefficients, intercept, ind_var); 00509 IF binomial_result THEN 00510 return 1; 00511 ELSE 00512 return 0; 00513 END IF; 00514 END IF; 00515 00516 RAISE EXCEPTION 'This regression family is not supported!'; 00517 END; 00518 $$ LANGUAGE plpgsql IMMUTABLE STRICT; 00519 00520 ------------------------------------------------------------------------ 00521 00522 /** 00523 * @brief Prediction for linear models use learned coefficients for a given example 00524 * 00525 * @param coefficients Linear fitting coefficients 00526 * @param intercept Linear fitting intercept 00527 * @param ind_var Features (independent variables) 00528 * 00529 * returns a double value 00530 */ 00531 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_gaussian_predict ( 00532 coefficients DOUBLE PRECISION[], 00533 intercept DOUBLE PRECISION, 00534 ind_var DOUBLE PRECISION[] 00535 ) RETURNS DOUBLE PRECISION AS 00536 'MODULE_PATHNAME', '__elastic_net_gaussian_predict' 00537 LANGUAGE C IMMUTABLE STRICT; 00538 00539 ------------------------------------------------------------------------ 00540 /** 00541 * @brief Prediction for logistic models use learned coefficients for a given example 00542 * 00543 * @param coefficients Logistic fitting coefficients 00544 * @param intercept Logistic fitting intercept 00545 * @param ind_var Features (independent variables) 00546 * 00547 * returns a boolean value 00548 */ 00549 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_predict ( 00550 coefficients DOUBLE PRECISION[], 00551 intercept DOUBLE PRECISION, 00552 ind_var DOUBLE PRECISION[] 00553 ) RETURNS BOOLEAN AS 00554 'MODULE_PATHNAME', '__elastic_net_binomial_predict' 00555 LANGUAGE C IMMUTABLE STRICT; 00556 00557 ------------------------------------------------------------------------ 00558 /** 00559 * @brief Compute the probability of belonging to the True class for a given observation 00560 * 00561 * @param coefficients Logistic fitting coefficients 00562 * @param intercept Logistic fitting intercept 00563 * @param ind_var Features (independent variables) 00564 * 00565 * returns a double value, which is the probability of this data point being True class 00566 */ 00567 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.elastic_net_binomial_prob ( 00568 coefficients DOUBLE PRECISION[], 00569 intercept DOUBLE PRECISION, 00570 ind_var DOUBLE PRECISION[] 00571 ) RETURNS DOUBLE PRECISION AS 00572 'MODULE_PATHNAME', '__elastic_net_binomial_prob' 00573 LANGUAGE C IMMUTABLE STRICT; 00574 00575 ------------------------------------------------------------------------ 00576 /* Compute the log-likelihood for one data point */ 00577 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__elastic_net_binomial_loglikelihood ( 00578 coefficients DOUBLE PRECISION[], 00579 intercept DOUBLE PRECISION, 00580 dep_var BOOLEAN, 00581 ind_var DOUBLE PRECISION[] 00582 ) RETURNS DOUBLE PRECISION AS 00583 'MODULE_PATHNAME', '__elastic_net_binomial_loglikelihood' 00584 LANGUAGE C IMMUTABLE STRICT; 00585 00586 ------------------------------------------------------------------------ 00587 -- Compute the solution for just one step ------------------------------ 00588 ------------------------------------------------------------------------ 00589 00590 CREATE TYPE MADLIB_SCHEMA.__elastic_net_result AS ( 00591 intercept DOUBLE PRECISION, 00592 coefficients DOUBLE PRECISION[], 00593 lambda_value DOUBLE PRECISION 00594 ); 00595 00596 ------------------------------------------------------------------------ 00597 00598 /* IGD */ 00599 00600 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_transition ( 00601 state DOUBLE PRECISION[], 00602 ind_var DOUBLE PRECISION[], 00603 dep_var DOUBLE PRECISION, 00604 pre_state DOUBLE PRECISION[], 00605 lambda DOUBLE PRECISION, 00606 alpha DOUBLE PRECISION, 00607 dimension INTEGER, 00608 stepsize DOUBLE PRECISION, 00609 total_rows INTEGER, 00610 xmean DOUBLE PRECISION[], 00611 ymean DOUBLE PRECISION, 00612 step_decay DOUBLE PRECISION 00613 ) RETURNS DOUBLE PRECISION[] 00614 AS 'MODULE_PATHNAME', 'gaussian_igd_transition' 00615 LANGUAGE C IMMUTABLE; 00616 00617 -- 00618 00619 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_merge ( 00620 state1 DOUBLE PRECISION[], 00621 state2 DOUBLE PRECISION[] 00622 ) RETURNS DOUBLE PRECISION[] AS 00623 'MODULE_PATHNAME', 'gaussian_igd_merge' 00624 LANGUAGE C IMMUTABLE STRICT; 00625 00626 -- 00627 00628 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_igd_final ( 00629 state DOUBLE PRECISION[] 00630 ) RETURNS DOUBLE PRECISION[] AS 00631 'MODULE_PATHNAME', 'gaussian_igd_final' 00632 LANGUAGE C IMMUTABLE STRICT; 00633 00634 /* 00635 * Perform one iteration step of IGD for linear models 00636 */ 00637 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step( 00638 /* ind_var */ DOUBLE PRECISION[], 00639 /* dep_var */ DOUBLE PRECISION, 00640 /* pre_state */ DOUBLE PRECISION[], 00641 /* lambda */ DOUBLE PRECISION, 00642 /* alpha */ DOUBLE PRECISION, 00643 /* dimension */ INTEGER, 00644 /* stepsize */ DOUBLE PRECISION, 00645 /* total_rows */ INTEGER, 00646 /* xmeans */ DOUBLE PRECISION[], 00647 /* ymean */ DOUBLE PRECISION, 00648 /* step_decay */ DOUBLE PRECISION 00649 ) ( 00650 SType = DOUBLE PRECISION[], 00651 SFunc = MADLIB_SCHEMA.__gaussian_igd_transition, 00652 m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,') 00653 FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final, 00654 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00655 ); 00656 00657 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_igd_step_single_seg ( 00658 /* ind_var */ DOUBLE PRECISION[], 00659 /* dep_var */ DOUBLE PRECISION, 00660 /* pre_state */ DOUBLE PRECISION[], 00661 /* lambda */ DOUBLE PRECISION, 00662 /* alpha */ DOUBLE PRECISION, 00663 /* dimension */ INTEGER, 00664 /* stepsize */ DOUBLE PRECISION, 00665 /* total_rows */ INTEGER, 00666 /* xmeans */ DOUBLE PRECISION[], 00667 /* ymean */ DOUBLE PRECISION, 00668 /* step_decay */ DOUBLE PRECISION 00669 ) ( 00670 SType = DOUBLE PRECISION[], 00671 SFunc = MADLIB_SCHEMA.__gaussian_igd_transition, 00672 -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_igd_merge,') 00673 FinalFunc = MADLIB_SCHEMA.__gaussian_igd_final, 00674 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00675 ); 00676 00677 -- 00678 00679 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_state_diff ( 00680 state1 DOUBLE PRECISION[], 00681 state2 DOUBLE PRECISION[] 00682 ) RETURNS DOUBLE PRECISION AS 00683 'MODULE_PATHNAME', '__gaussian_igd_state_diff' 00684 LANGUAGE C IMMUTABLE STRICT; 00685 00686 -- 00687 00688 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_igd_result ( 00689 in_state DOUBLE PRECISION[], 00690 feature_sq DOUBLE PRECISION[], 00691 threshold DOUBLE PRECISION, 00692 tolerance DOUBLE PRECISION 00693 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS 00694 'MODULE_PATHNAME', '__gaussian_igd_result' 00695 LANGUAGE C IMMUTABLE STRICT; 00696 00697 ------------------------------------------------------------------------ 00698 00699 /* FISTA */ 00700 00701 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_transition ( 00702 state DOUBLE PRECISION[], 00703 ind_var DOUBLE PRECISION[], 00704 dep_var DOUBLE PRECISION, 00705 pre_state DOUBLE PRECISION[], 00706 lambda DOUBLE PRECISION, 00707 alpha DOUBLE PRECISION, 00708 dimension INTEGER, 00709 total_rows INTEGER, 00710 max_stepsize DOUBLE PRECISION, 00711 eta DOUBLE PRECISION, 00712 use_active_set INTEGER, 00713 is_active INTEGER, 00714 random_stepsize INTEGER 00715 ) RETURNS DOUBLE PRECISION[] 00716 AS 'MODULE_PATHNAME', 'gaussian_fista_transition' 00717 LANGUAGE C IMMUTABLE; 00718 00719 -- 00720 00721 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_merge ( 00722 state1 DOUBLE PRECISION[], 00723 state2 DOUBLE PRECISION[] 00724 ) RETURNS DOUBLE PRECISION[] AS 00725 'MODULE_PATHNAME', 'gaussian_fista_merge' 00726 LANGUAGE C IMMUTABLE STRICT; 00727 00728 -- 00729 00730 CREATE FUNCTION MADLIB_SCHEMA.__gaussian_fista_final ( 00731 state DOUBLE PRECISION[] 00732 ) RETURNS DOUBLE PRECISION[] AS 00733 'MODULE_PATHNAME', 'gaussian_fista_final' 00734 LANGUAGE C IMMUTABLE STRICT; 00735 00736 /* 00737 Perform one iteration step of FISTA for linear models 00738 */ 00739 CREATE AGGREGATE MADLIB_SCHEMA.__gaussian_fista_step( 00740 /* ind_var */ DOUBLE PRECISION[], 00741 /* dep_var */ DOUBLE PRECISION, 00742 /* pre_state */ DOUBLE PRECISION[], 00743 /* lambda */ DOUBLE PRECISION, 00744 /* alpha */ DOUBLE PRECISION, 00745 /* dimension */ INTEGER, 00746 /* total_rows */ INTEGER, 00747 /* max_stepsize */ DOUBLE PRECISION, 00748 /* eta */ DOUBLE PRECISION, 00749 /* use_active_set */ INTEGER, 00750 /* is_active */ INTEGER, 00751 /* random_stepsize */ INTEGER 00752 ) ( 00753 SType = DOUBLE PRECISION[], 00754 SFunc = MADLIB_SCHEMA.__gaussian_fista_transition, 00755 m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__gaussian_fista_merge,') 00756 FinalFunc = MADLIB_SCHEMA.__gaussian_fista_final, 00757 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00758 ); 00759 00760 -- 00761 00762 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_state_diff ( 00763 state1 DOUBLE PRECISION[], 00764 state2 DOUBLE PRECISION[] 00765 ) RETURNS DOUBLE PRECISION AS 00766 'MODULE_PATHNAME', '__gaussian_fista_state_diff' 00767 LANGUAGE C IMMUTABLE STRICT; 00768 00769 -- 00770 00771 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__gaussian_fista_result ( 00772 in_state DOUBLE PRECISION[] 00773 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS 00774 'MODULE_PATHNAME', '__gaussian_fista_result' 00775 LANGUAGE C IMMUTABLE STRICT; 00776 00777 ------------------------------------------------------------------------ 00778 ------------------------------------------------------------------------ 00779 ------------------------------------------------------------------------ 00780 00781 /* Binomial IGD */ 00782 00783 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_transition ( 00784 state DOUBLE PRECISION[], 00785 ind_var DOUBLE PRECISION[], 00786 dep_var BOOLEAN, 00787 pre_state DOUBLE PRECISION[], 00788 lambda DOUBLE PRECISION, 00789 alpha DOUBLE PRECISION, 00790 dimension INTEGER, 00791 stepsize DOUBLE PRECISION, 00792 total_rows INTEGER, 00793 xmean DOUBLE PRECISION[], 00794 ymean DOUBLE PRECISION, 00795 step_decay DOUBLE PRECISION 00796 ) RETURNS DOUBLE PRECISION[] 00797 AS 'MODULE_PATHNAME', 'binomial_igd_transition' 00798 LANGUAGE C IMMUTABLE; 00799 00800 -- 00801 00802 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_merge ( 00803 state1 DOUBLE PRECISION[], 00804 state2 DOUBLE PRECISION[] 00805 ) RETURNS DOUBLE PRECISION[] AS 00806 'MODULE_PATHNAME', 'binomial_igd_merge' 00807 LANGUAGE C IMMUTABLE STRICT; 00808 00809 -- 00810 00811 CREATE FUNCTION MADLIB_SCHEMA.__binomial_igd_final ( 00812 state DOUBLE PRECISION[] 00813 ) RETURNS DOUBLE PRECISION[] AS 00814 'MODULE_PATHNAME', 'binomial_igd_final' 00815 LANGUAGE C IMMUTABLE STRICT; 00816 00817 /* 00818 * Perform one iteration step of IGD for linear models 00819 */ 00820 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step( 00821 /* ind_var */ DOUBLE PRECISION[], 00822 /* dep_var */ BOOLEAN, 00823 /* pre_state */ DOUBLE PRECISION[], 00824 /* lambda */ DOUBLE PRECISION, 00825 /* alpha */ DOUBLE PRECISION, 00826 /* dimension */ INTEGER, 00827 /* stepsize */ DOUBLE PRECISION, 00828 /* total_rows */ INTEGER, 00829 /* xmeans */ DOUBLE PRECISION[], 00830 /* ymean */ DOUBLE PRECISION, 00831 /* step_decay */ DOUBLE PRECISION 00832 ) ( 00833 SType = DOUBLE PRECISION[], 00834 SFunc = MADLIB_SCHEMA.__binomial_igd_transition, 00835 m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,') 00836 FinalFunc = MADLIB_SCHEMA.__binomial_igd_final, 00837 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00838 ); 00839 00840 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_igd_step_single_seg ( 00841 /* ind_var */ DOUBLE PRECISION[], 00842 /* dep_var */ BOOLEAN, 00843 /* pre_state */ DOUBLE PRECISION[], 00844 /* lambda */ DOUBLE PRECISION, 00845 /* alpha */ DOUBLE PRECISION, 00846 /* dimension */ INTEGER, 00847 /* stepsize */ DOUBLE PRECISION, 00848 /* total_rows */ INTEGER, 00849 /* xmeans */ DOUBLE PRECISION[], 00850 /* ymean */ DOUBLE PRECISION, 00851 /* step_decay */ DOUBLE PRECISION 00852 ) ( 00853 SType = DOUBLE PRECISION[], 00854 SFunc = MADLIB_SCHEMA.__binomial_igd_transition, 00855 -- m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_igd_merge,') 00856 FinalFunc = MADLIB_SCHEMA.__binomial_igd_final, 00857 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00858 ); 00859 00860 -- 00861 00862 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_state_diff ( 00863 state1 DOUBLE PRECISION[], 00864 state2 DOUBLE PRECISION[] 00865 ) RETURNS DOUBLE PRECISION AS 00866 'MODULE_PATHNAME', '__binomial_igd_state_diff' 00867 LANGUAGE C IMMUTABLE STRICT; 00868 00869 -- 00870 00871 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_igd_result ( 00872 in_state DOUBLE PRECISION[], 00873 feature_sq DOUBLE PRECISION[], 00874 threshold DOUBLE PRECISION, 00875 tolerance DOUBLE PRECISION 00876 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS 00877 'MODULE_PATHNAME', '__binomial_igd_result' 00878 LANGUAGE C IMMUTABLE STRICT; 00879 00880 ------------------------------------------------------------------------ 00881 00882 /* Binomial FISTA */ 00883 00884 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_transition ( 00885 state DOUBLE PRECISION[], 00886 ind_var DOUBLE PRECISION[], 00887 dep_var BOOLEAN, 00888 pre_state DOUBLE PRECISION[], 00889 lambda DOUBLE PRECISION, 00890 alpha DOUBLE PRECISION, 00891 dimension INTEGER, 00892 total_rows INTEGER, 00893 max_stepsize DOUBLE PRECISION, 00894 eta DOUBLE PRECISION, 00895 use_active_set INTEGER, 00896 is_active INTEGER, 00897 random_stepsize INTEGER 00898 ) RETURNS DOUBLE PRECISION[] 00899 AS 'MODULE_PATHNAME', 'binomial_fista_transition' 00900 LANGUAGE C IMMUTABLE; 00901 00902 -- 00903 00904 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_merge ( 00905 state1 DOUBLE PRECISION[], 00906 state2 DOUBLE PRECISION[] 00907 ) RETURNS DOUBLE PRECISION[] AS 00908 'MODULE_PATHNAME', 'binomial_fista_merge' 00909 LANGUAGE C IMMUTABLE STRICT; 00910 00911 -- 00912 00913 CREATE FUNCTION MADLIB_SCHEMA.__binomial_fista_final ( 00914 state DOUBLE PRECISION[] 00915 ) RETURNS DOUBLE PRECISION[] AS 00916 'MODULE_PATHNAME', 'binomial_fista_final' 00917 LANGUAGE C IMMUTABLE STRICT; 00918 00919 /* 00920 Perform one iteration step of FISTA for linear models 00921 */ 00922 CREATE AGGREGATE MADLIB_SCHEMA.__binomial_fista_step( 00923 /* ind_var */ DOUBLE PRECISION[], 00924 /* dep_var */ BOOLEAN, 00925 /* pre_state */ DOUBLE PRECISION[], 00926 /* lambda */ DOUBLE PRECISION, 00927 /* alpha */ DOUBLE PRECISION, 00928 /* dimension */ INTEGER, 00929 /* total_rows */ INTEGER, 00930 /* max_stepsize */ DOUBLE PRECISION, 00931 /* eta */ DOUBLE PRECISION, 00932 /* use_active_set */ INTEGER, 00933 /* is_active */ INTEGER, 00934 /* random_stepsize */ INTEGER 00935 ) ( 00936 SType = DOUBLE PRECISION[], 00937 SFunc = MADLIB_SCHEMA.__binomial_fista_transition, 00938 m4_ifdef(`GREENPLUM', `prefunc = MADLIB_SCHEMA.__binomial_fista_merge,') 00939 FinalFunc = MADLIB_SCHEMA.__binomial_fista_final, 00940 InitCond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' 00941 ); 00942 00943 -- 00944 00945 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_state_diff ( 00946 state1 DOUBLE PRECISION[], 00947 state2 DOUBLE PRECISION[] 00948 ) RETURNS DOUBLE PRECISION AS 00949 'MODULE_PATHNAME', '__binomial_fista_state_diff' 00950 LANGUAGE C IMMUTABLE STRICT; 00951 00952 -- 00953 00954 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__binomial_fista_result ( 00955 in_state DOUBLE PRECISION[] 00956 ) RETURNS MADLIB_SCHEMA.__elastic_net_result AS 00957 'MODULE_PATHNAME', '__binomial_fista_result' 00958 LANGUAGE C IMMUTABLE STRICT; 00959 00960