User Documentation
 All Files Functions Groups
Elastic Net Regularization
+ Collaboration diagram for Elastic Net Regularization:
About:

This module implements the elastic net regularization for regression problems.

This method seeks to find a weight vector that, for any given training example set, minimizes:

\[\min_{w \in R^N} L(w) + \lambda \left(\frac{(1-\alpha)}{2} \|w\|_2^2 + \alpha \|w\|_1 \right)\]

where \(L\) is the metric function that the user wants to minimize. Here \( \alpha \in [0,1] \) and \( lambda \geq 0 \). If \(alpha = 0\), we have the ridge regularization (known also as Tikhonov regularization), and if \(\alpha = 1\), we have the LASSO regularization.

For the Gaussian response family (or linear model), we have

\[L(\vec{w}) = \frac{1}{2}\left[\frac{1}{M} \sum_{m=1}^M (w^{t} x_m + w_{0} - y_m)^2 \right] \]

For the Binomial response family (or logistic model), we have

\[ L(\vec{w}) = \sum_{m=1}^M\left[y_m \log\left(1 + e^{-(w_0 + \vec{w}\cdot\vec{x}_m)}\right) + (1-y_m) \log\left(1 + e^{w_0 + \vec{w}\cdot\vec{x}_m}\right)\right]\ , \]

where \(y_m \in {0,1}\).

To get better convergence, one can rescale the value of each element of x

\[ x' \leftarrow \frac{x - \bar{x}}{\sigma_x} \]

and for Gaussian case we also let

\[y' \leftarrow y - \bar{y} \]

and then minimize with the regularization terms. At the end of the calculation, the orginal scales will be restored and an intercept term will be obtained at the same time as a by-product.

Note that fitting after scaling is not equivalent to directly fitting.

Right now, two optimizers are supported. The default one is FISTA, and the other is IGD. They have their own parameters, which can be specified in the optimizer_params as a text array. For example, 'max_stepsize = 0.1, warmup = t, warmup_lambdas = [0.4, 0.3, 0.2]'.

(1) FISTA

Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [2] has the following optimizer-specific parameters:

    - max_stepsize     - default is 4.0
    - eta              - default is 2, if stepsize does not work
                        stepsize/eta will be tried
    - warmup           - default is False
    - warmup_lambdas   - default is NULL, which means that lambda
                       values will be automatically generated
    - warmup_lambda_no - default is 15. How many lambda's are used in
                       warm-up, will be overridden if warmup_lambdas
                       is not NULL
    - warmup_tolerance - default is the same as tolerance. The value
                       of tolerance used during warmup.
    - use_active_set   - default is False. Whether to use active-set
                       method to speed up the computation.
    - activeset_tolerance - default is the same as tolerance. The
                          value of tolerance used during active set
                          calculation
    - random_stepsize - default is False. Whether add some randomness
                      to the step size. Sometimes, this can speed
                      up the calculation.

Here, backtracking for step size is used. At each iteration, we first try the stepsize = max_stepsize, and if it does not work out, we then try a smaller step size stepsize = stepsize / eta, where eta must be larger than 1. At first sight, this seems to do repeated iterations for even one step, but it actually greatly increases the computation speed by using a larger step size and minimizes the total number of iterations. A careful choice of max_stepsize can decrease the computation time by more than 10 times.

If warmup is True, a series of lambda values, which is strictly descent and ends at the lambda value that the user wants to calculate, will be used. The larger lambda gives very sparse solution, and the sparse solution again is used as the initial guess for the next lambda's solution, which will speed up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and might be faster than computation on only one lambda value.

If use_active_set is True, active-set method will be used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features— those with nonzero coefficients. After a complete cycle through all the variables, we iterate on only the active set till convergence. If another complete cycle does not change the active set, we are done, otherwise the process is repeated.

(2) IGD

Incremental Gradient Descent (IGD) or Stochastic Gradient Descent (SGD) [3] has the following optimizer-specific parameters:

    - stepsize         - default is 0.01
    - threshold        - default is 1e-10. When a coefficient is really
                        small, set it to be 0
    - warmup           - default is False
    - warmup_lambdas   - default is Null
    - warmup_lambda_no - default is 15. How many lambda's are used in
                        warm-up, will be overridden if warmup_lambdas
                        is not NULL
    - warmup_tolerance - default is the same as tolerance. The value
                        of tolerance used during warmup.
    - parallel         - default is True. Run the computation on
                       multiple segments or not.

Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, threshold is needed at the end of the computation to screen out those tiny values and just hard set them to be zeros. This is done as the following: (1) multiply each coefficient with the standard deviation of the corresponding feature (2) compute the average of absolute values of re-scaled coefficients (3) divide each rescaled coefficients with the average, and if the resulting absolute value is smaller than threshold, set the original coefficient to be zero.

SGD is in nature a sequential algorithm, and when running in a distributed way, each segment of the data runs its own SGD model, and the models are averaged to get a model for each iteration. This average might slow down the convergence speed, although we acquire the ability to process large datasets on multiple machines. So this algorithm provides an option parallel to let the user choose whether to do parallel computation.

Stopping Criteria Both optimizers compute the average difference between the coefficients of two consecutive iterations, and if the difference is smaller than tolerance or the iteration number is larger than max_iter, the computation stops.

Online Help The user can read short help messages by using any one of the following

SELECT madlib.elastic_net_train();
SELECT madlib.elastic_net_train('usage');
SELECT madlib.elastic_net_train('predict');
SELECT madlib.elastic_net_train('gaussian');
SELECT madlib.elastic_net_train('binomial');
SELECT madlib.elastic_net_train('linear');
SELECT madlib.elastic_net_train('fista');
SELECT madlib.elastic_net_train('igd');
Input:

The training examples is expected to be of the following form:

{TABLE|VIEW} <em>input_table</em> (
...
<em>independentVariables</em> DOUBLE PRECISION[],
<em>dependentVariable</em> DOUBLE PRECISION,
...
)

Null values are not expected.

Usage:

Pre-run Usually one gets better results and faster convergence using standardize = True. It is highly recommended to run elastic_net_train function on a subset of the data with limited max_iter before applying it onto the full data set with a large max_iter. In the pre-run, the user can tweak the parameters to get the best performance and then apply the best set of parameters onto the whole data set.

SELECT {schema_madlib}.elastic_net_train (
'tbl_source', -- Data table
'tbl_result', -- Result table
'col_dep_var', -- Dependent variable, can be an expression
'col_ind_var', -- Independent variable, can be an expression or '*'
'regress_family', -- 'gaussian' (or 'linear'). 'binomial'
(or 'logistic') will be supported
alpha, -- Elastic net control parameter, value in [0, 1]
lambda_value, -- Regularization parameter, positive
standardize, -- Whether to normalize the data. Default: True
'grouping_col', -- Group by which columns. Default: NULL
'optimizer', -- Name of optimizer. Default: 'fista'
'optimizer_params',-- Optimizer parameters, delimited by comma. Default: NULL
'excluded', -- Column names excluded from '*'. Default: NULL
max_iter, -- Maximum iteration number. Default: 10000
tolerance -- Stopping criteria. Default: 1e-6
);

If col_ind_var is '*', then all columns of tbl_source will be used as features except those listed in the excluded string. If the dependent variable is a column name, it is then automatically excluded from the features. However, if the dependent variable is a valid Postgres expression, then the column names inside this expression are not excluded unless explicitly put into the excluded list. So it is a good idea to put all column names involved in the dependent variable expression into the excluded string.

The excluded string is a list of column names excluded from features delimited by comma. For example, 'col1, col2'. If it is NULL or an empty string '', no column is excluded.

If col_ind_var is a single column name, which is the array type, one can still use excluded. For example, if x is a column name, which is an array of size 1000, and the user wants to exclude the 100-th, 200-th and 301-th elements of the array, he can set excluded to be '100, 200, 301'.

Both col_dep_var and col_ind_var can be valid Postgres expression. For example, col_dep_var = 'log(y+1)', and col_ind_var = 'array[exp(x[1]), x[2], 1/(1+x[3])]' etc. In the binomial case, one can set col_dep_var = 'y < 0' etc.

Output:

family | features | features_selected | coef_nonzero | coef_all | intercept | log_likelihood | standardize | iteration_run
------------------+------------+------------+------------+--------------+-------------+--------+--------+-----------
...

where log_likelihood is just the negative value of the first equation above (up to a constant depending on the data set).

Or

(1)

SELECT madlib.elastic_net_gaussian_predict (
coefficients, intercept, ind_var
) FROM tbl_result, tbl_new_source LIMIT 10;

(2)

SELECT madlib.elastic_net_binomial_predict (
coefficients, intercept, ind_var
) FROM tbl_result, tbl_new_source LIMIT 10;

This returns 10 BOOLEAN values.

(3)

SELECT madlib.elastic_net_binomial_prob (
coefficients, intercept, ind_var
) FROM tbl_result, tbl_new_source LIMIT 10;

This returns 10 probability values for True class.

Or The user can use another prediction function which stores the prediction result in a table. This is usefule if the user wants to use elastic net together with general cross validation function.

SELECT madlib.elastic_net_predict(
'tbl_train_result',
'tbl_data',
'col_id', -- ID associated with each row
'tbl_predict' -- Prediction result
);
Examples:
  1. Prepare an input table/view:
    CREATE TABLE en_data (
    ind_var DOUBLE PRECISION[],
    dep_var DOUBLE PRECISION
    );
  2. Populate the input table with some data, which should be well-conditioned, e.g.:
    INSERT INTO lasso_data values ({1, 1}, 0.89);
    INSERT INTO lasso_data values ({0.67, -0.06}, 0.3);
    ...
    INSERT INTO lasso_data values ({0.15, -1.3}, -1.3);
  3. learn coefficients, e.g.:
    SELECT madlib.elastic_net_train('en_data', 'en_model', 'ind_var', 'dep_var', 0.5, 0.1,
    True, 'linear', 'igd', 'stepsize = 0.1, warmup = t,
    warmup_lambda_no=3, warmup_lambdas = [0.4, 0.3, 0.2, 0.1],
    parallel=t', '1', 10000, 1e-6);
    SELECT madlib.elastic_net_predict(family, coef_all, intercept, ind_var)
    FROM en_data, en_model;
Literature:

[1] Elastic net regularization. http://en.wikipedia.org/wiki/Elastic_net_regularization

[2] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorothm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202.

[3] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.

See Also
File elastic_net.sql_in documenting the SQL functions.