2.1.0
User Documentation for Apache MADlib
Elastic Net Regularization

This module implements elastic net regularization [1] for linear and logistic regression. Regularization is a technique often used to prevent overfitting.

Training Function
The training function has the following syntax:
elastic_net_train( tbl_source,
                   tbl_result,
                   col_dep_var,
                   col_ind_var,
                   regress_family,
                   alpha,
                   lambda_value,
                   standardize,
                   grouping_col,
                   optimizer,
                   optimizer_params,
                   excluded,
                   max_iter,
                   tolerance
                 )

Arguments

tbl_source

TEXT. The name of the table containing the training data.

tbl_result

TEXT. Name of the output table containing output model. The output table produced by the elastic_net_train() function has the following columns:

regress_family The regression type: 'gaussian' or 'binomial'.
features Array of features (independent variables) passed to the algorithm.
features_selected Array of features selected by the algorithm.
coef_nonzero Coefficients of the selected features.
coef_all Coefficients of all features, both selected and unselected.
intercept Intercept for the model.
log_likelihood Log of the likelihood value produced by the algorithm.
standardize BOOLEAN. If data has been normalized, will be set to TRUE.
iteration_run The number of iterations executed.

col_dep_var

TEXT. An expression for the dependent variable.

Note
Both col_dep_var and col_ind_var can be valid PostgreSQL expressions. For example, col_dep_var = 'log(y+1)', and col_ind_var = 'array[exp(x[1]), x[2], 1/(1+x[3])]'. In the binomial case, you can use a Boolean expression, for example, col_dep_var = 'y < 0'.
col_ind_var

TEXT. An expression for the independent variables. Use '*' to specify all columns of tbl_source except those listed in the excluded string described below. If col_dep_var is a column name, it is automatically excluded from the independent variables. However, if col_dep_var is a valid PostgreSQL expression, any column names used within the expression are only excluded if they are explicitly listed in the excluded argument. Therefore, it is a good idea to add all column names involved in the dependent variable expression to the excluded string.

regress_family

TEXT. For regression type, specify either 'gaussian' ('linear') or 'binomial' ('logistic').

alpha

FLOAT8. Elastic net control parameter with a value in the range [0, 1]. A value of 1 means L1 regularization, and a value of 0 means L2 regularization.

lambda_value

FLOAT8. Regularization parameter (must be positive).

standardize (optional)

BOOLEAN, default: TRUE. Whether to normalize the data or not. Setting to TRUE usually yields better results and faster convergence.

grouping_col (optional)

TEXT, default: NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, resulting in one regression per group. When this value is NULL, no grouping is used and a single model is generated for all data.

Note
Expressions are not currently supported for 'grouping_col'.
optimizer (optional)

TEXT, default: 'fista'. Name of optimizer, either 'fista' or 'igd'. FISTA [2] is an algorithm with a fast global rate of convergence for solving linear inverse problems. Incremental gradient descent (IGD) is a stochastic approach to minimizing an objective function [4].

optimizer_params (optional)

TEXT, default: NULL. Optimizer parameters, delimited with commas. These parameters differ depending on the value of optimizer parameter. See the descriptions below for details.

excluded (optional)

TEXT, default: NULL. If the col_ind_var input is '*' then excluded can be provided as a comma-delimited list of column names that are to be excluded from the features. For example, 'col1, col2'. If the col_ind_var is an array, excluded must be a list of the integer array positions to exclude, for example '1,2'. If this argument is NULL or an empty string, no columns are excluded.

max_iter (optional)

INTEGER, default: 1000. The maximum number of iterations allowed.

tolerance (optional)
FLOAT8, default: 1e-6. This is the criterion to stop iterating. Both the 'fista' and 'igd' optimizers compute the difference between the log likelihood of two consecutive iterations, and when the difference is smaller than tolerance or the iteration number is larger than max_iter, the computation stops.

Other Parameters

For optimizer_params, there are several parameters that can be supplied in a string containing a comma-delimited list of name-value pairs . All of these named parameters are optional and use the format "<param_name> = <value>".

The parameters described below are organized by category: warmup, cross validation and optimization.

Warmup parameters

  $$
    warmup = <value>,
    warmup_lambdas = <value>,
    warmup_lambda_no = <value>,
    warmup_tolerance = <value>
  $$
warmup

Default: FALSE. If warmup is TRUE, a series of strictly descending lambda values are used, which end with the lambda value that the user wants to calculate. A larger lambda gives a sparser solution, and the sparse solution is then used as the initial guess for the next lambda's solution, which can speed up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and may in fact be faster than computation with only a single lambda value.

warmup_lambdas

Default: NULL. Set of lambda values to use when warmup is TRUE. The default is NULL, which means that lambda values will be automatically generated.

warmup_lambda_no

Default: 15. Number of lambda values used in warm-up. If warmup_lambdas is not NULL, this value is overridden by the number of provided lambda values.

warmup_tolerance
The value of tolerance used during warmup. The default value is the same as the tolerance argument described above.

Cross validation parameters

Note
Please note that for performance reasons, warmup is disabled whenever cross validation is used. Also, cross validation is not supported if grouping is used.
  $$
    n_folds = <value>,
    validation_result = <value>,
    lambda_value = <value>,
    n_lambdas = <value>,
    alpha = <value>
  $$

Hyperparameter optimization can be carried out using the built-in cross validation mechanism, which is activated by assigning a value greater than 1 to the parameter n_folds.

The cross validation scores are the mean and standard deviation of the accuracy when predicted on the validation fold, averaged over all folds and all rows. For classification, the accuracy metric used is the ratio of correct classifications. For regression, the accuracy metric used is the negative of mean squared error (negative to make it a concave problem, thus selecting max means the highest accuracy). Cross validation scores are written out to a separate table with the user specified name given in the 'validation_result' parameter.

The values of a parameter to cross validate should be provided in a list. For example, to regularize with the L1 norm and use a lambda value from the set {0.3, 0.4, 0.5}, include 'lambda_value={0.3, 0.4, 0.5}'. Note that the use of '{}' and '[]' are both valid here.

n_folds

Default: 0. Number of folds (k). Must be at least 2 to activate cross validation. If a value of k > 2 is specified, each fold is then used as a validation set once, while the other k - 1 folds form the training set.

validation_result

Default: NULL. Name of the table to store the cross validation results, including the values of parameters and their averaged error values. The table is only created if the name is not NULL.

lambda_value

Default: NULL. Set of regularization values to be used for cross validation. The default is NULL, which means that lambda values will be automatically generated.

n_lambdas

Default: 15. Number of lambdas to cross validate over. If a list of lambda values is not provided in the lambda_value set above, this parameter can be used to autogenerate the set of lambdas. If the lambda_value set is not NULL, this value is overridden by the number of provided lambda values.

Note
If you want to cross validate over alpha only and not lambda, then set lambda_value to NULL and n_lambdas to 0. In this case, cross validation will be done on the set of alpha values specified in the next parameter. The lambda value used will be the one specified in the main function call at the top of this page.
alpha
Elastic net control parameter. This is a list of values to apply cross validation on. (Note that alpha values are not autogenerated.) If not specified, the alpha value used will be the one specified in the main function call at the top of this page.

Optimizer parameters

FISTA Parameters

  $$
    max_stepsize = <value>,
    eta = <value>,
    use_active_set = <value>,
    activeset_tolerance = <value>,
    random_stepsize = <value>
  $$
max_stepsize

Default: 4.0. Initial backtracking step size. At each iteration, the algorithm first tries stepsize = max_stepsize, and if it does not work out, it then tries a smaller step size, stepsize = stepsize/eta, where eta must be larger than 1. At first glance, this seems to perform repeated iterations for even one step, but using a larger step size actually greatly increases the computation speed and minimizes the total number of iterations. A careful choice of max_stepsize can decrease the computation time by more than 10 times.

eta

Default: 2.0 If stepsize does not work, stepsize/eta is tried. Must be greater than 1.

use_active_set

Default: FALSE. If use_active_set is TRUE, an active-set method is used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features—those with nonzero coefficients. After a complete cycle through all the variables, we iterate only on the active set until convergence. If another complete cycle does not change the active set, we are done. Otherwise, the process is repeated.

activeset_tolerance

The value of tolerance used during active set calculation. The default value is the same as the tolerance argument described above.

random_stepsize
Default: FALSE. Whether to add some randomness to the step size. Sometimes, this can speed up the calculation.

IGD parameters

  $$
      stepsize = <value>,
      step_decay = <value>,
      threshold = <value>,
      parallel = <value>
  $$
stepsize

The default is 0.01.

step_decay

The actual stepsize used for current step is (previous stepsize) / exp(step_decay). The default value is 0, which means that a constant stepsize is used in IGD.

threshold

Default: 1e-10. When a coefficient is really small, set this coefficient to be 0.

Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, threshold is needed at the end of the computation to screen out tiny values and hard-set them to zeros. This is accomplished as follows: (1) multiply each coefficient with the standard deviation of the corresponding feature; (2) compute the average of absolute values of re-scaled coefficients; (3) divide each rescaled coefficient with the average, and if the resulting absolute value is smaller than threshold, set the original coefficient to zero.

parallel

Whether to run the computation on multiple segments. The default is TRUE.

SGD is a sequential algorithm in nature. When running in a distributed manner, each segment of the data runs its own SGD model and then the models are averaged to get a model for each iteration. This averaging might slow down the convergence speed, but it affords the ability to process large datasets on a cluster of machines. This algorithm, therefore, provides the parallel option to allow you to choose whether to do parallel computation.

Prediction Function

Per-Tuple Prediction

The prediction function returns a double value for the Gaussian family and a Boolean value for the Binomial family.

The predict function has the following syntax (elastic_net_gaussian_predict() and elastic_net_binomial_predict()):

elastic_net_<family>_predict(
                     coefficients,
                     intercept,
                     ind_var
                   )

Arguments

coefficients
DOUBLE PRECISION[]. Fitting coefficients, usually coef_all or coef_nonzero.
intercept
DOUBLE PRECISION. Intercept for the model.
ind_var
DOUBLE PRECISION[]. Independent variables that correspond to coefficients. Use features column in tbl_result for coef_all, and features_selected for coef_nonzero. See the examples for this case below.
Note
Unexpected results or errors may be returned in the case that this argument ind_var is not specified properly.

For the binomial family, there is a function (elastic_net_binomial_prob()) that outputs the probability of the instance being TRUE:

elastic_net_binomial_prob(
                     coefficients,
                     intercept,
                     ind_var
                   )

Per-Table Prediction

Alternatively, you can use another prediction function that stores the prediction result in a table (elastic_net_predict()). This is useful if you want to use elastic net together with the general cross validation function.

elastic_net_predict( tbl_model,
                     tbl_new_sourcedata,
                     col_id,
                     tbl_predict
                   )

Arguments

tbl_model
TEXT. Name of the table containing the output from the training function.
tbl_new_sourcedata
TEXT. Name of the table containing the new source data.
col_id
TEXT. Unique ID associated with each row.
tbl_predict
TEXT. Name of table to store the prediction result.

You do not need to specify whether the model is "linear" or "logistic" because this information is already included in the tbl_model table.

Examples
  1. Display online help for the elastic_net_train() function:
    SELECT madlib.elastic_net_train();
    
  2. Create an input data set of house prices and features:
    DROP TABLE IF EXISTS houses;
    CREATE TABLE houses ( id INT,
                          tax INT,
                          bedroom INT,
                          bath FLOAT,
                          price INT,
                          size INT,
                          lot INT,
                          zipcode INT);
    INSERT INTO houses (id, tax, bedroom, bath, price, size, lot, zipcode) VALUES
    (1  ,  590 ,       2 ,    1 ,  50000 ,  770 , 22100  , 94301),
    (2  , 1050 ,       3 ,    2 ,  85000 , 1410 , 12000  , 94301),
    (3  ,   20 ,       3 ,    1 ,  22500 , 1060 ,  3500  , 94301),
    (4  ,  870 ,       2 ,    2 ,  90000 , 1300 , 17500  , 94301),
    (5  , 1320 ,       3 ,    2 , 133000 , 1500 , 30000  , 94301),
    (6  , 1350 ,       2 ,    1 ,  90500 ,  820 , 25700  , 94301),
    (7  , 2790 ,       3 ,  2.5 , 260000 , 2130 , 25000  , 94301),
    (8  ,  680 ,       2 ,    1 , 142500 , 1170 , 22000  , 94301),
    (9  , 1840 ,       3 ,    2 , 160000 , 1500 , 19000  , 94301),
    (10 , 3680 ,       4 ,    2 , 240000 , 2790 , 20000  , 94301),
    (11 , 1660 ,       3 ,    1 ,  87000 , 1030 , 17500  , 94301),
    (12 , 1620 ,       3 ,    2 , 118600 , 1250 , 20000  , 94301),
    (13 , 3100 ,       3 ,    2 , 140000 , 1760 , 38000  , 94301),
    (14 , 2070 ,       2 ,    3 , 148000 , 1550 , 14000  , 94301),
    (15 ,  650 ,       3 ,  1.5 ,  65000 , 1450 , 12000  , 94301),
    (16 ,  770 ,       2 ,    2 ,  91000 , 1300 , 17500  , 76010),
    (17 , 1220 ,       3 ,    2 , 132300 , 1500 , 30000  , 76010),
    (18 , 1150 ,       2 ,    1 ,  91100 ,  820 , 25700  , 76010),
    (19 , 2690 ,       3 ,  2.5 , 260011 , 2130 , 25000  , 76010),
    (20 ,  780 ,       2 ,    1 , 141800 , 1170 , 22000  , 76010),
    (21 , 1910 ,       3 ,    2 , 160900 , 1500 , 19000  , 76010),
    (22 , 3600 ,       4 ,    2 , 239000 , 2790 , 20000  , 76010),
    (23 , 1600 ,       3 ,    1 ,  81010 , 1030 , 17500  , 76010),
    (24 , 1590 ,       3 ,    2 , 117910 , 1250 , 20000  , 76010),
    (25 , 3200 ,       3 ,    2 , 141100 , 1760 , 38000  , 76010),
    (26 , 2270 ,       2 ,    3 , 148011 , 1550 , 14000  , 76010),
    (27 ,  750 ,       3 ,  1.5 ,  66000 , 1450 , 12000  , 76010);
    
  3. Train the model:
    DROP TABLE IF EXISTS houses_en, houses_en_summary;
    SELECT madlib.elastic_net_train( 'houses',                  -- Source table
                                     'houses_en',               -- Result table
                                     'price',                   -- Dependent variable
                                     'array[tax, bath, size]',  -- Independent variable
                                     'gaussian',                -- Regression family
                                     0.5,                       -- Alpha value
                                     0.1,                       -- Lambda value
                                     TRUE,                      -- Standardize
                                     NULL,                      -- Grouping column(s)
                                     'fista',                   -- Optimizer
                                     '',                        -- Optimizer parameters
                                     NULL,                      -- Excluded columns
                                     100,                       -- Maximum iterations
                                     1e-6                       -- Tolerance value
                                   );
    
  4. View the resulting model:
    -- Turn on expanded display to make it easier to read results.
    \x on
    SELECT * FROM houses_en;
    
    Result:
    -[ RECORD 1 ]-----+-------------------------------------------
    family            | gaussian
    features          | {tax,bath,size}
    features_selected | {tax,bath,size}
    coef_nonzero      | {22.7898419099,10708.6395642,54.7864827154}
    coef_all          | {22.7898419099,10708.6395642,54.7864827154}
    intercept         | -7793.63839228
    log_likelihood    | -512248647.34
    standardize       | t
    iteration_run     | 100
    
  5. Use the prediction function to evaluate residuals:
    \x off
    SELECT id, price, predict, price - predict AS residual
    FROM (
        SELECT
            houses.*,
            madlib.elastic_net_gaussian_predict(
                m.coef_all,             -- Coefficients
                m.intercept,            -- Intercept
                ARRAY[tax,bath,size]    -- Features (corresponding to coefficients)
                ) AS predict
        FROM houses, houses_en m) s
    ORDER BY id;
    
    Result:
     id | price  |     predict      |     residual
    ----+--------+------------------+-------------------
      1 |  50000 |  58546.599589619 |   -8546.599589619
      2 |  85000 | 114801.915370229 |  -29801.915370229
      3 |  22500 |  61444.469688442 |  -38944.469688442
      4 |  90000 | 104673.230727753 |  -14673.230727753
      5 | 133000 | 125885.956130288 |  7114.04386971201
      6 |  90500 |  78606.203576913 |   11893.796423087
      7 | 260000 | 199256.827630643 |   60743.172369357
      8 | 142500 |   82512.27844767 |    59987.72155233
      9 | 160000 | 137736.673923436 |   22263.326076564
     10 | 240000 | 250344.545740518 |  -10344.545740518
     11 |  87000 |  97176.215939216 |  -10176.215939216
     12 | 118600 | 119026.288024408 | -426.288024408001
     13 | 140000 | 180696.360235914 |  -40696.360235914
     14 | 148000 | 156426.301262683 |   -8426.301262683
     15 |  65000 | 102523.118132785 |  -37523.118132785
     16 |  91000 | 102394.246536763 |  -11394.246536763
     17 | 132300 | 123606.971939298 |    8693.028060702
     18 |  91100 |  74048.235194933 |   17051.764805067
     19 | 260011 | 196977.843439653 |   63033.156560347
     20 | 141800 |   84791.26263866 |    57008.73736134
     21 | 160900 | 139331.962857129 |   21568.037142871
     22 | 239000 | 248521.358387726 | -9521.35838772598
     23 |  81010 |  95808.825424622 |  -14798.825424622
     24 | 117910 | 118342.592767111 | -432.592767110997
     25 | 141100 | 182975.344426904 |  -41875.344426904
     26 | 148011 | 160984.269644663 |  -12973.269644663
     27 |  66000 | 104802.102323775 |  -38802.102323775
    

Example with Grouping

  1. Reuse the houses table above and train the model by grouping on zip code:
    DROP TABLE IF EXISTS houses_en1, houses_en1_summary;
    SELECT madlib.elastic_net_train( 'houses',                  -- Source table
                                     'houses_en1',              -- Result table
                                     'price',                   -- Dependent variable
                                     'array[tax, bath, size]',  -- Independent variable
                                     'gaussian',                -- Regression family
                                     0.5,                       -- Alpha value
                                     0.1,                       -- Lambda value
                                     TRUE,                      -- Standardize
                                     'zipcode',                 -- Grouping column(s)
                                     'fista',                   -- Optimizer
                                     '',                        -- Optimizer parameters
                                     NULL,                      -- Excluded columns
                                     100,                       -- Maximum iterations
                                     1e-6                       -- Tolerance value
                                   );
    
  2. View the resulting model with a separate model for each group:
    -- Turn on expanded display to make it easier to read results.
    \x on
    SELECT * FROM houses_en1;
    
    Result:
    -[ RECORD 1 ]-----+--------------------------------------------
    zipcode           | 94301
    family            | gaussian
    features          | {tax,bath,size}
    features_selected | {tax,bath,size}
    coef_nonzero      | {27.6936321338,11508.8932488,49.0964826846}
    coef_all          | {27.6936321338,11508.8932488,49.0964826846}
    intercept         | -11146.6219839
    log_likelihood    | -520356660.297
    standardize       | t
    iteration_run     | 100
    -[ RECORD 2 ]-----+--------------------------------------------
    zipcode           | 76010
    family            | gaussian
    features          | {tax,bath,size}
    features_selected | {tax,bath,size}
    coef_nonzero      | {14.9934758192,9134.7157512,62.796927836}
    coef_all          | {14.9934758192,9134.7157512,62.796927836}
    intercept         | 27.0655065032
    log_likelihood    | -525632815.441
    standardize       | t
    iteration_run     | 100
    
  3. Use the prediction function to evaluate residuals:
    \x off
    SELECT madlib.elastic_net_predict(
                    'houses_en1',             -- Model table
                    'houses',                 -- New source data table
                    'id',                     -- Unique ID associated with each row
                    'houses_en1_prediction'   -- Table to store prediction result
                  );
    SELECT  houses.id,
            houses.price,
            houses_en1_prediction.prediction,
            houses.price - houses_en1_prediction.prediction AS residual
    FROM houses_en1_prediction, houses
    WHERE houses.id = houses_en1_prediction.id ORDER BY id;
    
    Result:
     id | price  |    prediction    |     residual
    ----+--------+------------------+-------------------
      1 |  50000 |  54505.805890984 | -4505.80589098399
      2 |  85000 | 110175.518839476 |  -25175.518839476
      3 |  22500 |  52958.415553252 |  -30458.415553252
      4 |  90000 |  99790.051960086 | -9790.05196008601
      5 | 133000 | 122071.482957216 |   10928.517042784
      6 |  90500 |  78007.790446902 |   12492.209553098
      7 | 260000 |   199466.3529096 |     60533.6470904
      8 | 142500 |  76636.825856866 |   65863.174143134
      9 | 160000 | 136472.171666792 |   23527.828333208
     10 | 240000 | 250762.917456118 |  -10762.917456118
     11 |  87000 |  96903.077772146 |   -9903.077772146
     12 | 118600 | 118105.451926206 |  494.548073793994
     13 | 140000 | 184131.233653376 |  -44131.233653376
     14 | 148000 | 156805.424440596 | -8805.42444059599
     15 |  65000 |   95307.47866894 |   -30307.47866894
     16 |  91000 | 111477.479576487 | -20477.4795764872
     17 | 132300 | 130783.929262327 |   1516.0707376728
     18 |  91100 | 77897.7592753032 |  13202.2407246968
     19 | 260011 | 196953.761128831 |  63057.2388711688
     20 | 141800 | 94329.0979647992 |  47470.9020352008
     21 | 160900 | 141129.427577575 |  19770.5724224248
     22 | 239000 | 247476.438620463 | -8476.43862046322
     23 |  81010 | 97832.1782395032 | -16822.1782395032
     24 | 117910 | 120632.283356431 |  -2722.2833564312
     25 | 141100 | 176798.212621703 | -35698.2126217032
     26 | 148011 | 158801.641015487 | -10790.6410154872
     27 |  66000 | 116029.791359903 | -50029.7913599032
    

    Example where coef_nonzero is different from coef_all

  1. Reuse the houses table above and train the model with alpha=1 (L1) and a large lambda value (30000).
    DROP TABLE IF EXISTS houses_en2, houses_en2_summary;
    SELECT madlib.elastic_net_train( 'houses',                  -- Source table
                                     'houses_en2',              -- Result table
                                     'price',                   -- Dependent variable
                                     'array[tax, bath, size]',  -- Independent variable
                                     'gaussian',                -- Regression family
                                     1,                         -- Alpha value
                                     30000,                     -- Lambda value
                                     TRUE,                      -- Standardize
                                     NULL,                      -- Grouping column(s)
                                     'fista',                   -- Optimizer
                                     '',                        -- Optimizer parameters
                                     NULL,                      -- Excluded columns
                                     1000,                     -- Maximum iterations
                                     1e-6                       -- Tolerance value
                                   );
    
  2. View the resulting model and see coef_nonzero is different from coef_all:
    -- Turn on expanded display to make it easier to read results.
    \x on
    SELECT * FROM houses_en2;
    
    Result:
    -[ RECORD 1 ]-----+--------------------------------
    family            | gaussian
    features          | {tax,bath,size}
    features_selected | {tax,size}
    coef_nonzero      | {6.94502439324,29.7183899065}
    coef_all          | {6.94502439324,0,29.7183899065}
    intercept         | 74442.858956
    log_likelihood    | -1635348583.9
    standardize       | t
    iteration_run     | 297
    
  3. We can still use the prediction function with coef_all to evaluate residuals:
    \x off
    SELECT id, price, predict, price - predict AS residual
    FROM (
        SELECT
            houses.*,
            madlib.elastic_net_gaussian_predict(
                m.coef_all,                   -- All coefficients
                m.intercept,                  -- Intercept
                ARRAY[tax,bath,size]          -- All features
                ) AS predict
        FROM houses, houses_en2 m) s
    ORDER BY id;
    
  4. We can speed up the prediction function with coef_nonzero to evaluate residuals. This requires the user to examine the feature_selected column in the result table to construct the correct set of independent variables to provide to the prediction function:
    \x off
    SELECT id, price, predict, price - predict AS residual
    FROM (
        SELECT
            houses.*,
            madlib.elastic_net_gaussian_predict(
                m.coef_nonzero,               -- Non-zero coefficients
                m.intercept,                  -- Intercept
                ARRAY[tax,size]               -- Features corresponding to non-zero coefficients
                ) AS predict
        FROM houses, houses_en2 m) s
    ORDER BY id;
    
    The two queries above will result in same residuals:
     id | price  |     predict      |     residual
    ----+--------+------------------+-------------------
      1 |  50000 | 101423.583576017 | -51423.5835760166
      2 |  85000 | 123638.064337067 |  -38638.064337067
      3 |  22500 | 106083.252744755 | -83583.2527447548
      4 |  90000 | 119118.937056569 | -29118.9370565688
      5 | 133000 | 128187.876014827 |  4812.12398517321
      6 |  90500 | 108187.721610204 |  -17687.721610204
      7 | 260000 | 157119.647513985 |  102880.352486015
      8 | 142500 | 113935.991734008 |  28564.0082659918
      9 | 160000 | 131799.288699312 |  28200.7113006884
     10 | 240000 | 182914.856562258 |  57085.1434377418
     11 |  87000 | 116581.541052473 | -29581.5410524734
     12 | 118600 | 122841.785856174 |  -4241.7858561738
     13 | 140000 | 148276.800810484 | -8276.80081048398
     14 | 148000 | 134882.563805082 |  13117.4361949182
     15 |  65000 | 122048.790176031 |  -57048.790176031
     16 |  91000 | 118424.434617245 | -27424.4346172448
     17 | 132300 | 127493.373575503 |   4806.6264244972
     18 |  91100 | 106798.716731556 |  -15698.716731556
     19 | 260011 | 156425.145074661 |  103585.854925339
     20 | 141800 | 114630.494173332 |  27169.5058266678
     21 | 160900 | 132285.440406838 |  28614.5595931616
     22 | 239000 | 182359.254610799 |   56640.745389201
     23 |  81010 | 116164.839588879 |  -35154.839588879
     24 | 117910 | 122633.435124377 |  -4723.4351243766
     25 | 141100 | 148971.303249808 |   -7871.303249808
     26 | 148011 |  136271.56868373 |  11739.4313162702
     27 |  66000 | 122743.292615355 |  -56743.292615355
    (27 rows)
    

Example with Cross Validation

  1. Reuse the houses table above. Here we use 3-fold cross validation with 3 automatically generated lambda values and 3 specified alpha values. (This can take some time to run since elastic net is effectively being called 27 times for these combinations, then a 28th time for the whole dataset.)
    DROP TABLE IF EXISTS houses_en3, houses_en3_summary, houses_en3_cv;
    SELECT madlib.elastic_net_train( 'houses',                  -- Source table
                                     'houses_en3',              -- Result table
                                     'price',                   -- Dependent variable
                                     'array[tax, bath, size]',  -- Independent variable
                                     'gaussian',                -- Regression family
                                     0.5,                       -- Alpha value
                                     0.1,                       -- Lambda value
                                     TRUE,                      -- Standardize
                                     NULL,                      -- Grouping column(s)
                                     'fista',                   -- Optimizer
                                     $$ n_folds = 3,            -- Cross validation parameters
                                        validation_result=houses_en3_cv,
                                        n_lambdas = 3,
                                        alpha = {0, 0.1, 1}
                                     $$,
                                     NULL,                      -- Excluded columns
                                     200,                       -- Maximum iterations
                                     1e-6                       -- Tolerance value
                                   );
    \x on
    SELECT * FROM houses_en3;
    
    -[ RECORD 1 ]-----+--------------------------------------------
    family            | gaussian
    features          | {tax,bath,size}
    features_selected | {tax,bath,size}
    coef_nonzero      | {22.4584875792,11657.0840746,52.1622709646}
    coef_all          | {22.4584875792,11657.0840746,52.1622709646}
    intercept         | -5067.2628524
    log_likelihood    | -543193170.15
    standardize       | t
    iteration_run     | 200
    
  2. Details of the cross validation:
    \x off
    SELECT * FROM houses_en3_cv ORDER BY mean_neg_loss DESC;
    
     alpha | lambda_value |     mean_score     |   std_dev_score
    -------+--------------+--------------------+--------------------
       0.0 |          0.1 | -36964.1349749     | 7006.24916542
       0.1 |          0.1 | -37033.6458432     | 7094.45992609
       1.0 |        100.0 | -38060.4537864     | 7891.42815774
       1.0 |          0.1 | -38097.4491274     | 7957.27212821
       0.1 |        100.0 | -58955.454086      | 12733.6947097
       0.0 |        100.0 | -59062.3214246     | 12731.3011318
       1.0 |     100000.0 | -60055.6624133     | 12708.5131797
       0.1 |     100000.0 | {large neg number} | {large pos number}
       0.0 |     100000.0 | {large neg number} | {large pos number}
    (9 rows)
    

Note
It is strongly recommended that you run elastic_net_train() on a subset of the data with a limited max_iter before applying it to the full data set with a large max_iter. In the pre-run, you can adjust the parameters to get the best performance and then apply the best set of parameters to the whole data set.

Technical Background

Elastic net regularization seeks to find a weight vector that, for any given training example set, minimizes:

\[\min_{w \in R^N} L(w) + \lambda \left(\frac{(1-\alpha)}{2} \|w\|_2^2 + \alpha \|w\|_1 \right)\]

where \(L\) is the metric function that the user wants to minimize. Here \( \alpha \in [0,1] \) and \( lambda \geq 0 \). If \(alpha = 0\), we have the ridge regularization (known also as Tikhonov regularization), and if \(\alpha = 1\), we have the LASSO regularization.

For the Gaussian response family (or linear model), we have

\[L(\vec{w}) = \frac{1}{2}\left[\frac{1}{M} \sum_{m=1}^M (w^{t} x_m + w_{0} - y_m)^2 \right] \]

For the Binomial response family (or logistic model), we have

\[ L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + \vec{w}\cdot\vec{x}_m}\right)\right]\ , \]

where \(y_m \in {0,1}\).

To get better convergence, one can rescale the value of each element of x

\[ x' \leftarrow \frac{x - \bar{x}}{\sigma_x} \]

and for Gaussian case we also let

\[y' \leftarrow y - \bar{y} \]

and then minimize with the regularization terms. At the end of the calculation, the orginal scales will be restored and an intercept term will be obtained at the same time as a by-product.

Note that fitting after scaling is not equivalent to directly fitting.

Literature

[1] Elastic net regularization, http://en.wikipedia.org/wiki/Elastic_net_regularization

[2] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202.

[3] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for L1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.

[4] Stochastic gradient descent, https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Related Topics

File elastic_net.sql_in documenting the SQL functions.