About:

Cross-validation, sometimes called rotation estimation, is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.

Input:

The flexible interface.

The input includes the data set, a training function, a prediction function and an error metric function.

The training function takes in a given data set with independent and dependent variables in it and produces a model, which is stored in an output table.

The prediction function takes in the model generated by the training function and a different data set with independent variables in it, and it produces a prediction of the dependent variables bease on the model. The prediction is stored in an output table. The prediction function should take a unique ID column name of the data table as one of the inputs, otherwise the prediction result cannot be compared with the validation values.

The error metric function takes in the prediction made by the prediction function, and compare with the known values of the dependent variables of the data set that was fed into the prediction function. It computes the error metric defined by the function. The results are stored in a table

Other inputs include the output table name, k value for the k-fold cross-validation, and how many folds the user wants to try (for example, the user can choose to run a simple validation instead of a full cross-validation.)

Usage:

The flexible interface.

In order to choose the optimum value for a parameter of the model, the user needs to provied the training function, prediction function, error metric function, the parameter and its values to be studied and the data set.

It would be better if the data set has a unique ID for each row, so that it is easier to cut the data set into the training part and the validation part. The user also needs to inform the cross validation (CV) function about whether this ID value is randomly assigned to each row. If it is not randomly assigned, the CV function will automatically generate a random ID for each row.

If the data set has no unique ID for each row, the CV function will copy the data set and create a randomly assigned ID column for the newly created temp table. The new table will be dropped after the computation is finished. To minimize the copying work load, the user needs to provide the data column names (for independent variables and dependent variables) that are going to be used in the calculation, and only these columns will be copied.

SELECT cross_validation_general(
    modelling_func,              -- Name of function that trains the model
    modelling_params,         -- Array of parameters for modelling function
    modelling_params_type, -- Types of each parameters for modelling function
    --
    param_explored,        -- Name of parameter that will be checked to find the optimum value, the
                                    ---- same name must also appear in the array of modelling_params
    explore_values,        -- Values of this parameter that will be studied
    --
    predict_func,          -- Name of function for prediction
    predict_params,        -- Array of parameters for prediction function
    predict_params_type,   -- Types of each parameters for prediction function
    --
    metric_func,           -- Name of function for measuring errors
    metric_params,         -- Array of parameters for error metric function
    metric_params_type,    -- Types of each parameters for metric function
    --
    data_tbl,              -- Data table which will be split into training and validation parts
    data_id,               -- Name of the unique ID associated with each row. Provide NULL
                                    ---- if there is no such column in the data table
    id_is_random,          -- Whether the provided ID is randomly assigned to each row
    --
    validation_result,     -- Table name to store the output of CV function, see the Output for
                                    ---- format. It will be automatically created by CV function
    --
    data_cols,              -- Names of data columns that are going to be used. It is only useful when
                                    ---- data_id is NULL, otherwise it is ignored.
    fold_num              -- Value of k. How many folds validation? Each validation uses 1/fold_num
                                    ---- fraction of the data for validation. Deafult value: 10.
);

Special keywords in parameter arrays of modelling, prediction and metric functions:

%data% : The argument position for training/validation data

%model% : The argument position for the output/input of modelling/prediction function

%id% : The argument position of unique ID column (provided by user or generated by CV function as is mentioned above)

%prediction% : The argument position for the output/input of prediction/metric function

%error% : The argument position for the output of metric function

Note: If the parameter explore_values is NULL or has zero length, then the cross validation function will only run a data folding.

Output:

  param_explored | average error | standard deviation of error
-------------------------|------------------|--------------------------------
                    .......

Note:

max_locks_per_transaction, which usually has the default value of 64, limits the number of tables that can be dropped inside a single transaction (the CV function). Thus the number of different values of param_explored (or the length of array explored_values) cannot be too large. For 10-fold cross validation, the limit of length(explored_values) is around 40. If this number is too large, the use might see "out of shared memory" error because max_locks_per_transaction is used up.

One way to overcome this limitation is to run CV function multiple times, and each run covers a different region of values of the parameter.

In the future, MADlib will implement cross-validation functions for each individual applicable module, where we can optimize the calculation to avoid table droppings and this max_locks_per_transaction limitation. However, such cross-validation functions need to know the implementation details of the modules to do the optimization and thus cannot be as flexible as the cross-validation function provided here.

The cross-validation function provided here is very flexible, and can actually work with any algorithms that the user want to cross-validate including the algorithms written by the user. The price for this flexiblity is that the algorithms' details cannot be utilized to optimize the calculation and thus max_locks_per_transaction limitation cannot be avoided.

Examples:

Cross validation is used on elastic net regression to find the best value of the regularization parameter.

(1) Populate the table 'cvtest' with 101 dimensional independent variable 'val', and dependent variable 'dep'.

(2) Run the general CV function

select madlib.cross_validation_general (
    'madlib.elastic_net_train',
    '{%data%, %model%, dep, val, gaussian, 1, lambda, True, Null, fista, "{eta = 2, max_stepsize = 2, use_active_set = t}", Null, 2000, 1e-6}'::varchar[],
    '{varchar, varchar, varchar, varchar, varchar, double precision, double precision, boolean, varchar, varchar, varchar[], varchar, integer, double precision}'::varchar[],
    --
    'lambda',
    '{0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, 0.32, 0.34, 0.36}'::varchar[],
    --
    'madlib.elastic_net_predict',
    '{%model%, %data%, %id%, %prediction%}'::varchar[],
    '{text, text, text, text}'::varchar[],
    --
    'madlib.mse_error', 
    '{%prediction%, %data%, %id%, dep, %error%}'::varchar[],
    '{varchar, varchar, varchar, varchar, varchar}'::varchar[],
    --
    'cvtest',
    NULL::varchar,
    False,
    --
    'valid_rst_tbl',
    '{val, dep}'::varchar[],
    10
);

See Also: File cross_validation.sql_in documenting the SQL functions.