Contents

About: Converts a feature array in a single column into multiple columns. This process can be used to reverse the function cols2vec.

Given a table with a column of type array, this function will create an output table that splits this array into multiple columns, one per array element. It includes the option to name the new feature columns, and to include columns from the original table in the output.

Usage

vec2cols(
    source_table,
    output_table,
    vector_col,
    feature_names,
    cols_to_output
)

Arguments

source_table

TEXT. Name of the table containing the source data..

output_table

TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned.

vector_col

TEXT. Name of the column containing the feature array. Must be a one-dimensional array.

feature_names (optional)

TEXT[]. Array of names associated with the feature array. Note that this array exists in the summary table created by the function 'cols2vec'. If the 'feature_names' array is not specified, column names will be automatically generated of the form 'f1, f2, ...fn'.

Note: If you specify the 'feature_names' parameter, you will get exactly that number of feature columns in the 'output_table'. It means feature arrays from the 'vector_col' may be padded or truncated, if a particular feature array size does not match the target number of feature columns.

If you do not specify the 'feature names' parameter, the number of feature columns generated in the 'output_table' will be the maximum array size from 'vector_col'. Feature arrays that are less than this maximum will be padded.

cols_to_output (optional)

TEXT, default NULL. Comma-separated string of column names from the source table to keep in the output table, in addition to the feature columns. To keep all columns from the source table, use '*'. The total number of columns in a table cannot exceed the PostgreSQL limits.

Output table
The output table produced by the vec2cols function contains the following columns:

<...>	Columns from source table, depending on which ones are kept (if any).
feature columns	Columns for each of the features in 'vector_col'. Column type will depend on the feature array type in the source table. Column naming will depend on whether the parameter 'feature_names' is used.

Examples

Load sample data:

DROP TABLE IF EXISTS golf CASCADE;
CREATE TABLE golf (
    id integer NOT NULL,
    "OUTLOOK" text,
    temperature double precision,
    humidity double precision,
    "Temp_Humidity" double precision[],
    clouds_airquality text[],
    windy boolean,
    class text,
    observation_weight double precision
);
INSERT INTO golf VALUES
(1,'sunny', 85, 85, ARRAY[85, 85],ARRAY['none', 'unhealthy'], 'false','Don''t Play', 5.0),
(2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['none', 'moderate'], 'true', 'Don''t Play', 5.0),
(3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['low', 'moderate'], 'false', 'Play', 1.5),
(4, 'rain', 70, 96, ARRAY[70, 96], ARRAY['low', 'moderate'], 'false', 'Play', 1.0),
(5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
(6, 'rain', 65, 70, ARRAY[65, 70], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0),
(7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
(8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['high', 'unhealthy'], 'false', 'Don''t Play', 5.0),
(9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['high', 'good'], 'false', 'Play', 5.0),
(10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
(11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['none', 'good'], 'true', 'Play', 5.0),
(12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
(13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['medium', 'moderate'], 'false', 'Play', 1.5),
(14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0);

Split the column "clouds_airquality" into new columns called "clouds" and "air_quality". Also keep columns id and "OUTLOOK" from the source table

DROP TABLE IF EXISTS vec2cols_result;
SELECT madlib.vec2cols(
    'golf',                           -- source table
    'vec2cols_result',                -- output table
    'clouds_airquality',              -- column with array entries to split
    ARRAY['clouds', 'air_quality'],   -- feature names
    'id, "OUTLOOK"'                   -- columns to keep from source table
);
SELECT * FROM vec2cols_result ORDER BY id;

 id | OUTLOOK  | clouds | air_quality
----+----------+--------+-------------
  1 | sunny    | none   | unhealthy
  2 | sunny    | none   | moderate
  3 | overcast | low    | moderate
  4 | rain     | low    | moderate
  5 | rain     | medium | good
  6 | rain     | low    | unhealthy
  7 | overcast | medium | moderate
  8 | sunny    | high   | unhealthy
  9 | sunny    | high   | good
 10 | rain     | medium | good
 11 | sunny    | none   | good
 12 | overcast | medium | moderate
 13 | overcast | medium | moderate
 14 | rain     | low    | unhealthy
(14 rows)

Similar to the previous example, except now we keep all columns from source table and do not specify the feature names, so that default names are created.

DROP TABLE IF EXISTS vec2cols_result;
SELECT madlib.vec2cols(
    'golf',                       -- source table
    'vec2cols_result',            -- output table
    'clouds_airquality',          -- column with array entries to split
    NULL,                         -- feature names
    '*'                           -- columns to keep from source table
);
SELECT * FROM vec2cols_result ORDER BY id;

 id | OUTLOOK  | temperature | humidity | Temp_Humidity | clouds_airquality | windy |   class    | observation_weight |   f1   |    f2
----+----------+-------------+----------+---------------+-------------------+-------+------------+--------------------+--------+-----------
  1 | sunny    |          85 |       85 | {85,85}       | {none,unhealthy}  | f     | Don't Play |                  5 | none   | unhealthy
  2 | sunny    |          80 |       90 | {80,90}       | {none,moderate}   | t     | Don't Play |                  5 | none   | moderate
  3 | overcast |          83 |       78 | {83,78}       | {low,moderate}    | f     | Play       |                1.5 | low    | moderate
  4 | rain     |          70 |       96 | {70,96}       | {low,moderate}    | f     | Play       |                  1 | low    | moderate
  5 | rain     |          68 |       80 | {68,80}       | {medium,good}     | f     | Play       |                  1 | medium | good
  6 | rain     |          65 |       70 | {65,70}       | {low,unhealthy}   | t     | Don't Play |                  1 | low    | unhealthy
  7 | overcast |          64 |       65 | {64,65}       | {medium,moderate} | t     | Play       |                1.5 | medium | moderate
  8 | sunny    |          72 |       95 | {72,95}       | {high,unhealthy}  | f     | Don't Play |                  5 | high   | unhealthy
  9 | sunny    |          69 |       70 | {69,70}       | {high,good}       | f     | Play       |                  5 | high   | good
 10 | rain     |          75 |       80 | {75,80}       | {medium,good}     | f     | Play       |                  1 | medium | good
 11 | sunny    |          75 |       70 | {75,70}       | {none,good}       | t     | Play       |                  5 | none   | good
 12 | overcast |          72 |       90 | {72,90}       | {medium,moderate} | t     | Play       |                1.5 | medium | moderate
 13 | overcast |          81 |       75 | {81,75}       | {medium,moderate} | f     | Play       |                1.5 | medium | moderate
 14 | rain     |          71 |       80 | {71,80}       | {low,unhealthy}   | t     | Don't Play |                  1 | low    | unhealthy
(14 rows)

Now let's run cols2vec then reverse it using vec2cols. In this case we will get feature names from the cols2vec summary table. First run cols2vec:

DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary;
SELECT madlib.cols2vec(
    'golf',
    'cols2vec_result',
    'temperature, humidity',
    NULL,
    'id, temperature, humidity'
);
SELECT * FROM cols2vec_result ORDER BY id;

 id | temperature | humidity | feature_vector
----+-------------+----------+----------------
  1 |          85 |       85 | {85,85}
  2 |          80 |       90 | {80,90}
  3 |          83 |       78 | {83,78}
  4 |          70 |       96 | {70,96}
  5 |          68 |       80 | {68,80}
  6 |          65 |       70 | {65,70}
  7 |          64 |       65 | {64,65}
  8 |          72 |       95 | {72,95}
  9 |          69 |       70 | {69,70}
 10 |          75 |       80 | {75,80}
 11 |          75 |       70 | {75,70}
 12 |          72 |       90 | {72,90}
 13 |          81 |       75 | {81,75}
 14 |          71 |       80 | {71,80}
(14 rows)

View the summary table with the feature_names dictionary:

\x on
SELECT * FROM cols2vec_result_summary;
\x off

-[ RECORD 1 ]---------------+-----------------------
source_table                | golf
list_of_features            | temperature, humidity
list_of_features_to_exclude | None
feature_names               | {temperature,humidity}

Now use feature_names from the summary table above to name the columns of the split array:

DROP TABLE IF EXISTS vec2cols_result;
SELECT madlib.vec2cols(
    'cols2vec_result',          -- source table containing the feature vector
    'vec2cols_result',          -- output table
    'feature_vector',           -- column with array entries to split
    (SELECT feature_names from cols2vec_result_summary),    -- feature_names from summary table of cols2vec
    'id'                        -- columns to keep from source table
);
SELECT * FROM vec2cols_result ORDER BY id;

 id | temperature | humidity
----+-------------+----------
  1 |          85 |       85
  2 |          80 |       90
  3 |          83 |       78
  4 |          70 |       96
  5 |          68 |       80
  6 |          65 |       70
  7 |          64 |       65
  8 |          72 |       95
  9 |          69 |       70
 10 |          75 |       80
 11 |          75 |       70
 12 |          72 |       90
 13 |          81 |       75
 14 |          71 |       80
(14 rows)

This is the same as the format of the original 'golf' dataset that we started with.