2.1.0
User Documentation for Apache MADlib

This preprocessor prepares training data for deep learning.

It packs multiple training examples into the same row for frameworks like Keras and TensorFlow that support mini-batching as an optimization option. The advantage of using mini-batching is that it can perform better than stochastic gradient descent because it uses more than one training example at a time, typically resulting in faster and smoother convergence [1].

In the case of image processing, images can be represented as an array of numbers where each element represents grayscale, RGB or other channel values for each pixel in the image. It is standard practice to normalize the image data before training. The normalizing constant in this module is parameterized, so it can be set depending on the format of image data used.

This preprocessor also sets the distribution rules for the training data. For example, you may only want to train models on segments that reside on hosts that are GPU enabled.

There are two versions of the preprocessor: training_preprocessor_dl() preprocesses input data to be used for training a deep learning model, while validation_preprocessor_dl() preprocesses validation data used for model evaluation.

Preprocessor for Training Data
training_preprocessor_dl(source_table,
                         output_table,
                         dependent_varname,
                         independent_varname,
                         buffer_size,
                         normalizing_const,
                         num_classes,
                         distribution_rules
                        )

Arguments

source_table

TEXT. Name of the table containing training dataset. Can also be a view.

output_table

TEXT. Name of the output table from the training preprocessor which will be used as input to algorithms that support mini-batching. Note that the arrays packed into the output table are shuffled and normalized, by dividing each element in the independent variable array by the optional 'normalizing_const' parameter. For performance reasons, packed arrays are converted to PostgreSQL bytea format, which is a variable-length binary string.

In the case a validation data set is used (see later on this page), this output table is also used as an input to the validation preprocessor so that the validation and training data are both preprocessed in an identical manner.

dependent_varname
TEXT. Name of the dependent variable column. In the case that there are multiple dependent variable columns, representing a multi-output neural network, put the columns as a comma separated list, e.g., 'dep_var1, dep_var2, dep_var3'.
Note
The mini-batch preprocessor automatically 1-hot encodes dependent variables of all types. The exception is numeric array types (integer and float), where we assume these are already 1-hot encoded, so these will just be passed through as is.
independent_varname

TEXT. Name of the independent variable column. The column must be a numeric array type. In the case that there are multiple independent variable columns, representing a multi-input neural network, put the columns as a comma separated list, e.g., 'indep_var1, indep_var2, indep_var3'.

buffer_size (optional)
INTEGER, default: computed. Buffer size is the number of rows from the source table that are packed into one row of the preprocessor output table. In the case of images, the source table will have one image per row, and the output table will have multiple images per row. The default value is computed considering the sizes of the source table and images and the number of segments in the database cluster.
Note
Using the default for 'buffer_size' will produce buffers that are relatively large, which generally results in the fastest fit() runtime with Keras. Setting a smaller buffer size may cause the preprocessor to run faster (although this is not guaranteed, since it depends on database cluster size, data set, and other factors). But since preprocessing is usually a one-time operation and fit() is called many times, by default buffer sizes are optimized to make fit() as fast as possible. Note that specifying a 'buffer_size' does not guarantee that exact value will be used. Actual buffer size is adjusted to avoid data skew, which adversely impacts fit() runtime.
normalizing_const (optional)

REAL, default: 1.0. The normalizing constant to divide each value in the 'independent_varname' array by. For example, you would use 255 for this value if the image data is in the form 0-255.

num_classes (optional)

INTEGER[], default: NULL. Number of class labels of each dependent variable for 1-hot encoding. If NULL, the 1-hot encoded array length will be equal to the number of distinct class values found in the input table.

distribution_rules (optional)

TEXT, default: 'all_segments'. Specifies how to distribute the 'output_table'. This is important for how the fit function will use resources on the cluster. The default 'all_segments' means the 'output_table' will be distributed to all segments in the database cluster.

If you specify 'gpu_segments' then the 'output_table' will be distributed to all segments that are on hosts that have GPUs attached. This will make maximum use of GPU resources when training a deep learning model.

You can also specify the name of a resources table containing the segments to be used for training. This table must contain a column called 'dbid' that specifies the segment id from the 'gp_segment_configuration' table [2]. Refer to the utility function GPU Configuration for more information on how to identify segments attached to hosts that are GPU enabled.

Preprocessor for Validation Data
validation_preprocessor_dl(source_table,
                           output_table,
                           dependent_varname,
                           independent_varname,
                           training_preprocessor_table,
                           buffer_size,
                           distribution_rules
                          )

Arguments

source_table

TEXT. Name of the table containing validation dataset. Can also be a view.

output_table

TEXT. Name of the output table from the validation preprocessor which will be used as input to algorithms that support mini-batching. The arrays packed into the output table are normalized using the same normalizing constant from the training preprocessor as specified in the 'training_preprocessor_table' parameter described below. Validation data is not shuffled. For performance reasons, packed arrays are converted to PostgreSQL bytea format, which is a variable-length binary string.

dependent_varname
TEXT. Name of the dependent variable column.
Note
The mini-batch preprocessor automatically 1-hot encodes dependent variables of all types. The exception is numeric array types (integer and float), where we assume these are already 1-hot encoded, so these will just be passed through as is.
independent_varname

TEXT. Name of the independent variable column. The column must be a numeric array type.

training_preprocessor_table

TEXT. The output table obtained by running training_preprocessor_dl(). Validation data is preprocessed in the same way as training data, i.e., same normalizing constant and dependent variable class values. Note that even if the validation dataset is missing some of the class values completely, this parameter will ensure that the ordering and labels still match with training dataset.

buffer_size (optional)
INTEGER, default: computed. Buffer size is the number of rows from the source table that are packed into one row of the preprocessor output table. In the case of images, the source table will have one image per row, and the output table will have multiple images per row. The default value is computed considering the sizes of the source table and images and the number of segments in the database cluster.
Note
Using the default for 'buffer_size' will produce buffers that are relatively large, which generally results in the fastest fit() runtime with Keras. Setting a smaller buffer size may cause the preprocessor to run faster (although this is not guaranteed, since it depends on database cluster size, data set, and other factors). But since preprocessing is usually a one-time operation and fit() is called many times, by default buffer sizes are optimized to make fit() as fast as possible. Note that specifying a 'buffer_size' does not guarantee that exact value will be used. Actual buffer size is adjusted to avoid data skew, which adversely impacts fit() runtime.
distribution_rules (optional)

TEXT, default: 'all_segments'. Specifies how to distribute the 'output_table'. This is important for how the fit function will use resources on the cluster. The default 'all_segments' means the 'output_table' will be distributed to all segments in the database cluster.

If you specify 'gpu_segments' then the 'output_table' will be distributed to all segments that are on hosts that have GPUs attached. This will make maximum use of GPU resources when training a deep learning model.

You can also specify the name of a resources table containing the segments to be used for training. This table must contain a column called 'dbid' that specifies the segment id from the 'gp_segment_configuration' table [2]. Refer to the utility function GPU Configuration for more information on how to identify segments attached to hosts that are GPU enabled.

Output Tables

The output tables produced by both training_preprocessor_dl() and validation_preprocessor_dl() contain the following columns:
<independent_varname> BYTEA. Packed array of independent variables in PostgreSQL bytea format. Arrays of independent variables packed into the output table are normalized by dividing each element in the independent variable array by the optional 'normalizing_const' parameter. Training data is shuffled, but validation data is not.
<dependent_varname> BYTEA. Packed array of dependent variables in PostgreSQL bytea format. The dependent variable is always one-hot encoded as an integer array. For now, we are assuming that input_preprocessor_dl() will be used only for classification problems using deep learning. So the dependent variable is one-hot encoded, unless it's already a numeric array in which case we assume it's already one-hot encoded and just cast it to an integer array.
<independent_varname>_shape INTEGER[]. Shape of the independent variable array after preprocessing. The first element is the number of images packed per row, and subsequent elements will depend on how the image is described (e.g., channels first or last).
<dependent_varname>_shape INTEGER[]. Shape of the dependent variable array after preprocessing. The first element is the number of images packed per row, and the second element is the number of class values.
buffer_id INTEGER. Unique id for each row in the packed table.

A summary table named <output_table>_summary is also created, which has the following columns (the columns are the same for both validation_preprocessor_dl() and training_preprocessor_dl() ):

source_table Name of the source table.
output_table Name of output table generated by preprocessor.
dependent_varname Dependent variable from the source table.
independent_varname Independent variable from the source table.
dependent_vartype Type of the dependent variable from the source table.
<dependent_varname>_class_values The dependent level values that one-hot encoding maps to for the dependent variable.
buffer_size Buffer size used in preprocessing step.
normalizing_const The value used to normalize the input image data.
num_classes Number of dependent levels the one-hot encoding is created for. NULLs are padded at the end if the number of distinct class levels found in the input data is less than the 'num_classes' parameter specified in training_preprocessor_dl().
distribution_rules This is the list of segment id's in the form of 'dbid' describing how the 'output_table' is distributed, as per the 'distribution_rules' input parameter. If the 'distribution_rules' parameter is set to 'all_segments', then this will also be set to 'all_segments'.
__internal_gpu_config__ For internal use. (Note: this is the list of segment id's where data is distributed in the form of 'content' id, which is different from 'dbid' [2].)

Examples
  1. Create an artificial 2x2 resolution color image data set with 3 possible classifications. The RGB values are per-pixel arrays:
    DROP TABLE IF EXISTS image_data;
    CREATE TABLE image_data AS (
        SELECT ARRAY[
            ARRAY[
                ARRAY[(random() * 256)::integer, -- pixel (1,1)
                    (random() * 256)::integer,
                    (random() * 256)::integer],
                ARRAY[(random() * 256)::integer, -- pixel (2,1)
                    (random() * 256)::integer,
                    (random() * 256)::integer]
            ],
            ARRAY[
                ARRAY[(random() * 256)::integer, -- pixel (1,2)
                    (random() * 256)::integer,
                    (random() * 256)::integer],
                ARRAY[(random() * 256)::integer, -- pixel (2,1)
                    (random() * 256)::integer,
                    (random() * 256)::integer]
            ]
        ] as rgb, ('{cat,dog,bird}'::text[])[ceil(random()*3)] as species
        FROM generate_series(1, 52)
    );
    SELECT * FROM image_data;
    
                                 rgb                              | species
    --------------------------------------------------------------+---------
     {{{124,198,44},{91,47,130}},{{24,175,69},{196,189,166}}}     | dog
     {{{111,202,129},{198,249,254}},{{141,37,88},{187,167,113}}}  | dog
     {{{235,53,39},{145,167,209}},{{197,147,222},{55,218,53}}}    | dog
     {{{231,48,125},{248,233,151}},{{63,125,230},{33,24,70}}}     | dog
     {{{92,146,121},{163,241,110}},{{75,88,72},{218,90,12}}}      | bird
     {{{88,114,59},{202,211,152}},{{92,76,58},{77,186,134}}}      | dog
     {{{2,96,255},{14,48,19}},{{240,55,115},{137,255,245}}}       | dog
     {{{165,122,98},{16,115,240}},{{4,106,116},{108,242,210}}}    | dog
     {{{155,207,101},{214,167,24}},{{118,240,228},{199,230,21}}}  | dog
     {{{94,212,15},{48,66,170}},{{255,167,128},{166,191,246}}}    | dog
     {{{169,69,131},{16,98,225}},{{228,113,17},{38,27,17}}}       | bird
     {{{156,183,139},{146,77,46}},{{80,202,230},{146,84,239}}}    | dog
     {{{190,210,147},{227,31,66}},{{229,251,84},{51,118,240}}}    | bird
     {{{253,175,200},{237,151,107}},{{207,56,162},{133,39,35}}}   | cat
     {{{146,185,108},{14,10,105}},{{188,210,86},{83,61,36}}}      | dog
     {{{223,169,177},{3,200,250}},{{112,91,16},{193,32,151}}}     | cat
     {{{249,145,240},{144,153,58}},{{131,156,230},{56,50,75}}}    | dog
     {{{212,186,229},{52,251,197}},{{230,121,201},{35,215,119}}}  | cat
     {{{234,94,23},{114,196,94}},{{242,249,90},{223,24,109}}}     | bird
     {{{111,36,145},{77,135,123}},{{171,158,237},{111,252,222}}}  | dog
     {{{90,74,240},{231,133,95}},{{11,21,173},{146,144,88}}}      | cat
     {{{170,52,237},{13,114,71}},{{87,99,46},{220,194,56}}}       | bird
     {{{8,17,92},{64,2,203}},{{10,131,145},{4,129,30}}}           | cat
     {{{217,218,207},{74,68,186}},{{127,107,76},{38,60,16}}}      | bird
     {{{193,34,83},{203,99,58}},{{251,224,50},{228,118,113}}}     | dog
     {{{146,218,155},{32,159,243}},{{146,218,189},{101,114,25}}}  | bird
     {{{179,160,74},{204,81,246}},{{50,189,39},{60,42,185}}}      | cat
     {{{13,82,174},{198,151,84}},{{65,249,100},{179,234,104}}}    | cat
     {{{162,190,124},{184,66,138}},{{10,240,80},{161,68,145}}}    | dog
     {{{164,144,199},{53,42,111}},{{122,174,128},{220,143,100}}}  | cat
     {{{160,138,104},{177,86,3}},{{104,226,149},{181,16,229}}}    | dog
     {{{246,119,211},{229,249,119}},{{117,192,172},{159,47,38}}}  | cat
     {{{175,1,220},{18,78,124}},{{156,181,45},{242,185,148}}}     | bird
     {{{50,113,246},{101,213,180}},{{56,103,151},{87,169,124}}}   | cat
     {{{73,109,147},{22,81,197}},{{135,71,42},{91,251,98}}}       | bird
     {{{206,61,255},{25,151,211}},{{211,124,7},{206,64,237}}}     | cat
     {{{201,71,34},{182,142,43}},{{198,172,171},{230,1,23}}}      | bird
     {{{142,158,2},{223,45,205}},{{118,177,223},{232,178,141}}}   | cat
     {{{86,190,128},{195,172,14}},{{97,173,237},{142,123,99}}}    | cat
     {{{26,72,148},{79,226,156}},{{96,62,220},{99,9,230}}}        | bird
     {{{154,234,103},{184,18,65}},{{146,225,139},{214,156,10}}}   | cat
     {{{244,169,103},{218,143,2}},{{196,246,186},{214,55,76}}}    | bird
     {{{20,226,7},{96,153,200}},{{130,236,147},{229,38,142}}}     | bird
     {{{172,102,107},{50,11,109}},{{145,9,123},{193,28,107}}}     | bird
     {{{143,243,247},{132,104,137}},{{94,3,169},{253,246,59}}}    | bird
     {{{78,74,228},{51,200,218}},{{170,155,190},{164,18,51}}}     | dog
     {{{163,226,161},{56,182,239}},{{129,154,35},{73,116,205}}}   | bird
     {{{74,243,3},{172,182,149}},{{101,34,163},{111,138,95}}}     | cat
     {{{224,178,126},{4,61,93}},{{174,238,96},{118,232,208}}}     | bird
     {{{55,236,249},{7,189,242}},{{151,173,130},{49,232,5}}}      | bird
     {{{9,16,30},{128,32,85}},{{108,25,91},{41,11,243}}}          | bird
     {{{141,35,191},{146,240,141}},{{207,239,166},{102,194,121}}} | bird
    (52 rows)
    
  2. Run the preprocessor for training image data:
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                            'image_data_packed',  -- Output table
                                            'species',            -- Dependent variable
                                            'rgb',                -- Independent variable
                                            NULL,                 -- Buffer size
                                            255                   -- Normalizing constant
                                            );
    
    For small datasets like in this example, buffer size is mainly determined by the number of segments in the database. For a Greenplum database with 2 segments, there will be 2 rows with a buffer size of 26. For PostgresSQL, there would be only one row with a buffer size of 52 since it is a single node database. For larger data sets, other factors go into computing buffers size besides number of segments. Here is the packed output table of training data for our simple example:
    SELECT rgb_shape, species_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
    
     rgb_shape  | species_shape | buffer_id
    ------------+---------------+-----------
     {18,2,2,3} | {18,3}        |         0
     {18,2,2,3} | {18,3}        |         1
     {16,2,2,3} | {16,3}        |         2
    (3 rows)
    
    Review the output summary table:
    \x on
    SELECT * FROM image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+------------------
    source_table            | image_data
    output_table            | image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog}
    buffer_size             | 18
    normalizing_const       | 255
    num_classes             | {3}
    distribution_rules      | all_segments
    __internal_gpu_config__ | all_segments
    
  3. Run the preprocessor for the validation dataset. In this example, we use the same images for validation to demonstrate, but normally validation data is different than training data:
    DROP TABLE IF EXISTS val_image_data_packed, val_image_data_packed_summary;
    SELECT madlib.validation_preprocessor_dl(
          'image_data',             -- Source table
          'val_image_data_packed',  -- Output table
          'species',                -- Dependent variable
          'rgb',                    -- Independent variable
          'image_data_packed',      -- From training preprocessor step
          NULL                      -- Buffer size
          );
    
    We could choose to use a different buffer size compared to the training_preprocessor_dl run (but generally don't need to). Other parameters such as num_classes and normalizing_const that were passed to training_preprocessor_dl are automatically inferred using the image_data_packed param that is passed. Here is the packed output table of validation data for our simple example:
    SELECT rgb_shape, species_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
    
     rgb_shape  | species_shape | buffer_id
    ------------+---------------+-----------
     {18,2,2,3} | {18,3}        |         0
     {18,2,2,3} | {18,3}        |         1
     {16,2,2,3} | {16,3}        |         2
    (3 rows)
    
    Review the output summary table:
    \x on
    SELECT * FROM val_image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+----------------------
    source_table            | image_data
    output_table            | val_image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog}
    buffer_size             | 18
    normalizing_const       | 255
    num_classes             | {3}
    distribution_rules      | all_segments
    __internal_gpu_config__ | all_segments
    
  4. Load data in another format. Create an artificial 2x2 resolution color image data set with 3 possible classifications. The RGB values are unrolled into a flat array:
    DROP TABLE IF EXISTS image_data;
    CREATE TABLE image_data AS (
    SELECT ARRAY[
            (random() * 256)::integer, -- R values
            (random() * 256)::integer,
            (random() * 256)::integer,
            (random() * 256)::integer,
            (random() * 256)::integer, -- G values
            (random() * 256)::integer,
            (random() * 256)::integer,
            (random() * 256)::integer,
            (random() * 256)::integer, -- B values
            (random() * 256)::integer,
            (random() * 256)::integer,
            (random() * 256)::integer
        ] as rgb, ('{cat,dog,bird}'::text[])[ceil(random()*3)] as species
    FROM generate_series(1, 52)
    );
    SELECT * FROM image_data;
    
                           rgb                        | species
    --------------------------------------------------+---------
     {26,150,191,113,235,57,145,143,44,145,85,25}     | dog
     {240,43,225,15,220,136,186,209,49,130,55,111}    | bird
     {25,191,37,77,193,62,249,228,97,33,81,7}         | cat
     {141,223,46,195,201,19,207,78,160,130,157,89}    | cat
     {39,249,168,164,223,193,99,4,14,37,66,7}         | cat
     {159,250,127,44,151,254,11,211,247,137,79,233}   | cat
     {19,230,76,253,42,175,230,143,184,133,27,215}    | cat
     {199,224,144,5,64,19,200,186,109,218,108,70}     | bird
     {148,136,4,41,185,104,203,253,113,151,166,76}    | bird
     {230,132,114,213,210,139,91,199,240,142,203,75}  | bird
     {166,188,96,217,135,70,93,249,27,47,132,118}     | bird
     {118,120,222,236,110,83,240,47,19,206,222,51}    | bird
     {230,3,26,47,93,144,167,59,123,21,142,107}       | cat
     {250,224,62,136,112,142,88,187,24,1,168,216}     | bird
     {52,144,231,12,76,1,162,11,114,141,69,3}         | cat
     {166,172,246,169,200,102,62,57,239,75,165,88}    | dog
     {151,50,112,227,199,97,47,4,43,123,116,133}      | bird
     {39,185,96,127,80,248,177,191,218,120,32,9}      | dog
     {25,172,34,34,40,109,166,23,60,216,246,54}       | bird
     {163,39,89,170,95,230,137,141,169,82,159,121}    | dog
     {131,143,183,138,151,90,177,240,4,16,214,141}    | dog
     {99,233,100,9,159,140,30,202,29,169,120,62}      | bird
     {99,162,69,10,204,169,219,20,106,170,111,16}     | bird
     {16,246,27,32,187,226,0,75,231,64,94,175}        | bird
     {25,135,244,101,50,4,91,77,36,22,47,37}          | dog
     {22,101,191,197,96,138,78,198,155,138,193,51}    | bird
     {236,22,110,30,181,20,218,21,236,97,91,73}       | dog
     {160,57,34,212,239,197,233,174,164,97,88,153}    | cat
     {226,170,192,123,242,224,190,51,163,192,91,105}  | bird
     {149,174,12,72,112,1,37,153,118,201,79,121}      | bird
     {34,250,232,222,218,221,234,201,138,66,186,58}   | bird
     {162,55,85,159,247,234,77,3,50,189,4,87}         | dog
     {122,32,164,243,0,198,237,232,164,199,197,142}   | dog
     {80,209,75,138,169,236,193,254,140,184,232,217}  | bird
     {112,148,114,137,13,107,105,75,243,218,218,75}   | dog
     {241,76,61,202,76,112,90,51,125,166,52,30}       | bird
     {75,132,239,207,49,224,250,19,238,214,154,169}   | dog
     {203,43,222,58,231,5,243,71,131,67,63,52}        | cat
     {229,12,133,142,179,80,185,145,138,160,149,125}  | bird
     {64,251,61,153,13,100,145,181,8,112,118,107}     | dog
     {128,223,60,248,126,124,243,188,20,0,31,166}     | bird
     {39,22,43,146,138,174,33,65,56,184,155,234}      | dog
     {177,247,133,154,159,37,148,30,81,43,29,92}      | bird
     {56,127,199,118,105,120,109,239,18,12,20,166}    | cat
     {101,209,72,193,207,91,166,27,88,209,203,62}     | dog
     {131,195,122,90,18,178,217,217,40,66,81,149}     | cat
     {203,137,103,17,60,251,152,64,36,81,168,239}     | cat
     {239,97,10,20,194,32,121,129,228,217,11,50}      | dog
     {117,4,193,192,223,176,33,232,196,226,8,61}      | dog
     {162,21,190,223,120,170,245,230,200,170,250,163} | bird
     {32,67,65,195,2,39,198,28,86,35,172,254}         | dog
     {39,19,236,146,87,140,203,121,96,187,62,73}      | dog
    (52 rows)
    
  5. Run the preprocessor for training image data:
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                            'image_data_packed',  -- Output table
                                            'species',            -- Dependent variable
                                            'rgb',                -- Independent variable
                                            NULL,                 -- Buffer size
                                            255                   -- Normalizing constant
                                            );
    
    Here is a sample of the packed output table:
    SELECT rgb_shape, species_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
    
     rgb_shape | species_shape | buffer_id
    -----------+---------------+-----------
     {18,12}   | {18,3}        |         0
     {18,12}   | {18,3}        |         1
     {16,12}   | {16,3}        |         2
    (3 rows)
    
  6. Run the preprocessor for the validation dataset. In this example, we use the same images for validation to demonstrate, but normally validation data is different than training data:
    DROP TABLE IF EXISTS val_image_data_packed, val_image_data_packed_summary;
    SELECT madlib.validation_preprocessor_dl(
        'image_data',             -- Source table
        'val_image_data_packed',  -- Output table
        'species',                -- Dependent variable
        'rgb',                    -- Independent variable
        'image_data_packed',      -- From training preprocessor step
        NULL                      -- Buffer size
        );
    
    Here is a sample of the packed output summary table:
    SELECT rgb_shape, species_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;
    
     rgb_shape | species_shape | buffer_id
    -----------+---------------+-----------
     {18,12}   | {18,3}        |         0
     {18,12}   | {18,3}        |         1
     {16,12}   | {16,3}        |         2
    (3 rows)
    
  7. Generally the default buffer size will work well, but if you have occasion to change it:
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                           'image_data_packed',  -- Output table
                                           'species',            -- Dependent variable
                                           'rgb',                -- Independent variable
                                            10,                   -- Buffer size
                                            255                   -- Normalizing constant
                                            );
    SELECT rgb_shape, species_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
    
     rgb_shape | species_shape | buffer_id
    -----------+---------------+-----------
     {9,12}    | {9,3}         |         0
     {9,12}    | {9,3}         |         1
     {9,12}    | {9,3}         |         2
     {9,12}    | {9,3}         |         3
     {9,12}    | {9,3}         |         4
     {7,12}    | {7,3}         |         5
    (6 rows)
    
    Review the output summary table:
    \x on
    SELECT * FROM image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+------------------
    source_table            | image_data
    output_table            | image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog}
    buffer_size             | 9
    normalizing_const       | 255
    num_classes             | {3}
    distribution_rules      | all_segments
    __internal_gpu_config__ | all_segments
    
  8. Run the preprocessor for image data with num_classes greater than 3 (distinct class values found in table):
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                            'image_data_packed',  -- Output table
                                            'species',            -- Dependent variable
                                            'rgb',                -- Independent variable
                                            NULL,                 -- Buffer size
                                            255,                  -- Normalizing constant
                                            ARRAY[5]              -- Number of desired class values
                                            );
    
    Here is a sample of the packed output table with the padded 1-hot vector:
    SELECT rgb_shape, species_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;
    
     rgb_shape | species_shape | buffer_id
    -----------+---------------+-----------
     {18,12}   | {18,5}        |         0
     {18,12}   | {18,5}        |         1
     {16,12}   | {16,5}        |         2
    (3 rows)
    
    Review the output summary table:
    \x on
    SELECT * FROM image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+-------------------------
    source_table            | image_data
    output_table            | image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog,NULL,NULL}
    buffer_size             | 18
    normalizing_const       | 255
    num_classes             | {5}
    distribution_rules      | all_segments
    __internal_gpu_config__ | all_segments
    
  9. Using distribution rules to specify how to distribute the 'output_table'. This is important for how the fit function will use resources on the cluster. To distribute to all segments on hosts with GPUs attached:
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',          -- Source table
                                            'image_data_packed',  -- Output table
                                            'species',            -- Dependent variable
                                            'rgb',                -- Independent variable
                                            NULL,                 -- Buffer size
                                            255,                  -- Normalizing constant
                                            NULL,                 -- Number of classes
                                            'gpu_segments'        -- Distribution rules
                                            );
    \x on
    SELECT * FROM image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+------------------
    source_table            | image_data
    output_table            | image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog}
    buffer_size             | 26
    normalizing_const       | 255
    num_classes             | {3}
    distribution_rules      | {2,3,4,5}
    __internal_gpu_config__ | {0,1,2,3}
    
    To distribute to only specified segments, create a distribution table with a column called 'dbid' that lists the segments you want:
    DROP TABLE IF EXISTS segments_to_use;
    CREATE TABLE segments_to_use(
        dbid INTEGER,
        hostname TEXT
    );
    INSERT INTO segments_to_use VALUES
    (2, 'hostname-01'),
    (3, 'hostname-01');
    DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
    SELECT madlib.training_preprocessor_dl('image_data',          -- Source table
                                            'image_data_packed',  -- Output table
                                            'species',            -- Dependent variable
                                            'rgb',                -- Independent variable
                                            NULL,                 -- Buffer size
                                            255,                  -- Normalizing constant
                                            NULL,                 -- Number of classes
                                            'segments_to_use'     -- Distribution rules
                                            );
    \x on
    SELECT * FROM image_data_packed_summary;
    
    -[ RECORD 1 ]-----------+------------------
    source_table            | image_data
    output_table            | image_data_packed
    dependent_varname       | {species}
    independent_varname     | {rgb}
    dependent_vartype       | {text}
    species_class_values    | {bird,cat,dog}
    buffer_size             | 26
    normalizing_const       | 255
    num_classes             | {3}
    distribution_rules      | {2,3}
    __internal_gpu_config__ | {0,1}
    

References

[1] "Neural Networks for Machine Learning", Lectures 6a and 6b on mini-batch gradient descent, Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

[2] Greenplum 'gp_segment_configuration' table https://gpdb.docs.pivotal.io/latest/ref_guide/system_catalogs/gp_segment_configuration.html

Related Topics

training_preprocessor_dl()

validation_preprocessor_dl()

gpu_configuration()