2.1.0
User Documentation for Apache MADlib
Train-Test Split

Train-test split is a utility to create training and testing sets from a single data set.

Train-Test Split
train_test_split(   source_table,
                    output_table,
                    train_proportion,
                    test_proportion,
                    grouping_cols,
                    target_cols,
                    with_replacement,
                    separate_output_tables
                )

Arguments

source_table

TEXT. Name of the table containing the input data.

output_table

Name of output table. A new INTEGER column on the right called 'split' will identify 1 for train set and 0 for test set, unless the 'separate_output_tables' parameter below is TRUE, in which case two output tables will be created using the 'output_table' name with the suffixes '_train' and '_test'. The output table contains all the columns present in the source table unless otherwise specified in the 'target_cols' parameter below.

train_proportion

FLOAT8 in the range (0,1). Proportion of the dataset to include in the train split. If the 'grouping_col' parameter is specified below, each group will be sampled independently using the train proportion, i.e., in a stratified fashion.

test_proportion (optional)

FLOAT8 in the range (0,1). Proportion of the dataset to include in the test split. Default is the complement to the train proportion (1-'train_proportion'). If the 'grouping_col' parameter is specified below, each group will be sampled independently using the train proportion, i.e., in a stratified fashion.

grouping_cols (optional)

TEXT, default: NULL. A single column or a list of comma-separated columns that defines how to stratify. When this parameter is NULL, the train-test split is not stratified.

target_cols (optional)

TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. If NULL or '*', all columns from the 'source_table' will appear in the 'output_table'.

Note
Do not include 'grouping_cols' in the parameter 'target_cols', because they are always included in the 'output_table'.
with_replacement (optional)

BOOLEAN, default FALSE. Determines whether to sample with replacement or without replacement (default). With replacement means that it is possible that the same row may appear in the sample set more than once. Without replacement means a given row can be selected only once.

separate_output_tables (optional)
BOOLEAN, default FALSE. If TRUE, two output tables will be created using the 'output_table' name with the suffixes '_train' and '_test'.

Examples

Please note that due to the random nature of sampling, your results may look different from those below.

  1. Create an input table:
    DROP TABLE IF EXISTS test;
    CREATE TABLE test(
        id1 INTEGER,
        id2 INTEGER,
        gr1 INTEGER,
        gr2 INTEGER
    );
    INSERT INTO test VALUES
    (1,0,1,1),
    (2,0,1,1),
    (3,0,1,1),
    (4,0,1,1),
    (5,0,1,1),
    (6,0,1,1),
    (7,0,1,1),
    (8,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (0,1,1,2),
    (0,2,1,2),
    (0,3,1,2),
    (0,4,1,2),
    (0,5,1,2),
    (0,6,1,2),
    (10,10,2,2),
    (20,20,2,2),
    (30,30,2,2),
    (40,40,2,2),
    (50,50,2,2),
    (60,60,2,2),
    (70,70,2,2);
    
  2. Sample without replacement:
    DROP TABLE IF EXISTS out;
    SELECT madlib.train_test_split(
                                    'test',    -- Source table
                                    'out',     -- Output table
                                    0.5,       -- Sample proportion
                                    0.5,       -- Sample proportion
                                    'gr1,gr2', -- Strata definition
                                    'id1,id2', -- Columns to output
                                    FALSE,     -- Sample without replacement
                                    FALSE);    -- Do not separate output tables
    SELECT * FROM out ORDER BY split,gr1,gr2,id1,id2;
    
     gr1 | gr2 | id1 | id2 | split
    -----+-----+-----+-----+-------
       1 |   1 |   1 |   0 |     0
       1 |   1 |   4 |   0 |     0
       1 |   1 |   6 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   2 |   0 |   3 |     0
       1 |   2 |   0 |   4 |     0
       1 |   2 |   0 |   5 |     0
       2 |   2 |  10 |  10 |     0
       2 |   2 |  30 |  30 |     0
       2 |   2 |  40 |  40 |     0
       2 |   2 |  60 |  60 |     0
       1 |   1 |   2 |   0 |     1
       1 |   1 |   3 |   0 |     1
       1 |   1 |   5 |   0 |     1
       1 |   1 |   7 |   0 |     1
       1 |   1 |   8 |   0 |     1
       1 |   1 |   9 |   0 |     1
       1 |   2 |   0 |   1 |     1
       1 |   2 |   0 |   2 |     1
       1 |   2 |   0 |   6 |     1
       2 |   2 |  20 |  20 |     1
       2 |   2 |  50 |  50 |     1
       2 |   2 |  70 |  70 |     1
    (25 rows)
    
  3. Sample with replacement and create separate train and test tables:
    DROP TABLE IF EXISTS out, out_train, out_test;
    SELECT madlib.train_test_split(
                                    'test',    -- Source table
                                    'out',     -- Output table
                                    0.5,       -- train_proportion
                                    NULL,      -- Default = 1 - train_proportion = 0.5
                                    'gr1,gr2', -- Strata definition
                                    'id1,id2', -- Columns to output
                                    TRUE,      -- Sample with replacement
                                    TRUE);     -- Separate output tables
    SELECT * FROM out_train ORDER BY gr1,gr2,id1,id2;
    
     gr1 | gr2 | id1 | id2
    -----+-----+-----+-----
       1 |   1 |   1 |   0
       1 |   1 |   2 |   0
       1 |   1 |   4 |   0
       1 |   1 |   7 |   0
       1 |   1 |   8 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   4
       1 |   2 |   0 |   5
       1 |   2 |   0 |   6
       2 |   2 |  40 |  40
       2 |   2 |  50 |  50
       2 |   2 |  50 |  50
    (12 rows)
    
    SELECT * FROM out_test ORDER BY gr1,gr2,id1,id2;
    
     gr1 | gr2 | id1 | id2
    ----—+----—+----—+---—
       1 |   1 |   1 |   0
       1 |   1 |   1 |   0
       1 |   1 |   3 |   0
       1 |   1 |   4 |   0
       1 |   1 |   5 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   1
       1 |   2 |   0 |   5
       1 |   2 |   0 |   6
       2 |   2 |  20 |  20
       2 |   2 |  20 |  20
       2 |   2 |  20 |  20
       2 |   2 |  70 |  70
    (13 rows)