Given a table with a column of type array, this function will create an output table that splits this array into multiple columns, one per array element. It includes the option to name the new feature columns, and to include columns from the original table in the output.
vec2cols( source_table, output_table, vector_col, feature_names, cols_to_output )
Arguments
TEXT. Name of the table containing the source data..
TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned.
TEXT. Name of the column containing the feature array. Must be a one-dimensional array.
Output table
The output table produced by the vec2cols function contains the following columns:
<...> | Columns from source table, depending on which ones are kept (if any). |
---|---|
feature columns | Columns for each of the features in 'vector_col'. Column type will depend on the feature array type in the source table. Column naming will depend on whether the parameter 'feature_names' is used. |
DROP TABLE IF EXISTS golf CASCADE; CREATE TABLE golf ( id integer NOT NULL, "OUTLOOK" text, temperature double precision, humidity double precision, "Temp_Humidity" double precision[], clouds_airquality text[], windy boolean, class text, observation_weight double precision ); INSERT INTO golf VALUES (1,'sunny', 85, 85, ARRAY[85, 85],ARRAY['none', 'unhealthy'], 'false','Don''t Play', 5.0), (2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['none', 'moderate'], 'true', 'Don''t Play', 5.0), (3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['low', 'moderate'], 'false', 'Play', 1.5), (4, 'rain', 70, 96, ARRAY[70, 96], ARRAY['low', 'moderate'], 'false', 'Play', 1.0), (5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0), (6, 'rain', 65, 70, ARRAY[65, 70], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0), (7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5), (8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['high', 'unhealthy'], 'false', 'Don''t Play', 5.0), (9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['high', 'good'], 'false', 'Play', 5.0), (10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0), (11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['none', 'good'], 'true', 'Play', 5.0), (12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5), (13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['medium', 'moderate'], 'false', 'Play', 1.5), (14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0);
DROP TABLE IF EXISTS vec2cols_result; SELECT madlib.vec2cols( 'golf', -- source table 'vec2cols_result', -- output table 'clouds_airquality', -- column with array entries to split ARRAY['clouds', 'air_quality'], -- feature names 'id, "OUTLOOK"' -- columns to keep from source table ); SELECT * FROM vec2cols_result ORDER BY id;
id | OUTLOOK | clouds | air_quality ----+----------+--------+------------- 1 | sunny | none | unhealthy 2 | sunny | none | moderate 3 | overcast | low | moderate 4 | rain | low | moderate 5 | rain | medium | good 6 | rain | low | unhealthy 7 | overcast | medium | moderate 8 | sunny | high | unhealthy 9 | sunny | high | good 10 | rain | medium | good 11 | sunny | none | good 12 | overcast | medium | moderate 13 | overcast | medium | moderate 14 | rain | low | unhealthy (14 rows)
DROP TABLE IF EXISTS vec2cols_result; SELECT madlib.vec2cols( 'golf', -- source table 'vec2cols_result', -- output table 'clouds_airquality', -- column with array entries to split NULL, -- feature names '*' -- columns to keep from source table ); SELECT * FROM vec2cols_result ORDER BY id;
id | OUTLOOK | temperature | humidity | Temp_Humidity | clouds_airquality | windy | class | observation_weight | f1 | f2 ----+----------+-------------+----------+---------------+-------------------+-------+------------+--------------------+--------+----------- 1 | sunny | 85 | 85 | {85,85} | {none,unhealthy} | f | Don't Play | 5 | none | unhealthy 2 | sunny | 80 | 90 | {80,90} | {none,moderate} | t | Don't Play | 5 | none | moderate 3 | overcast | 83 | 78 | {83,78} | {low,moderate} | f | Play | 1.5 | low | moderate 4 | rain | 70 | 96 | {70,96} | {low,moderate} | f | Play | 1 | low | moderate 5 | rain | 68 | 80 | {68,80} | {medium,good} | f | Play | 1 | medium | good 6 | rain | 65 | 70 | {65,70} | {low,unhealthy} | t | Don't Play | 1 | low | unhealthy 7 | overcast | 64 | 65 | {64,65} | {medium,moderate} | t | Play | 1.5 | medium | moderate 8 | sunny | 72 | 95 | {72,95} | {high,unhealthy} | f | Don't Play | 5 | high | unhealthy 9 | sunny | 69 | 70 | {69,70} | {high,good} | f | Play | 5 | high | good 10 | rain | 75 | 80 | {75,80} | {medium,good} | f | Play | 1 | medium | good 11 | sunny | 75 | 70 | {75,70} | {none,good} | t | Play | 5 | none | good 12 | overcast | 72 | 90 | {72,90} | {medium,moderate} | t | Play | 1.5 | medium | moderate 13 | overcast | 81 | 75 | {81,75} | {medium,moderate} | f | Play | 1.5 | medium | moderate 14 | rain | 71 | 80 | {71,80} | {low,unhealthy} | t | Don't Play | 1 | low | unhealthy (14 rows)
DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary; SELECT madlib.cols2vec( 'golf', 'cols2vec_result', 'temperature, humidity', NULL, 'id, temperature, humidity' ); SELECT * FROM cols2vec_result ORDER BY id;
id | temperature | humidity | feature_vector ----+-------------+----------+---------------- 1 | 85 | 85 | {85,85} 2 | 80 | 90 | {80,90} 3 | 83 | 78 | {83,78} 4 | 70 | 96 | {70,96} 5 | 68 | 80 | {68,80} 6 | 65 | 70 | {65,70} 7 | 64 | 65 | {64,65} 8 | 72 | 95 | {72,95} 9 | 69 | 70 | {69,70} 10 | 75 | 80 | {75,80} 11 | 75 | 70 | {75,70} 12 | 72 | 90 | {72,90} 13 | 81 | 75 | {81,75} 14 | 71 | 80 | {71,80} (14 rows)View the summary table with the feature_names dictionary:
\x on SELECT * FROM cols2vec_result_summary; \x off
-[ RECORD 1 ]---------------+----------------------- source_table | golf list_of_features | temperature, humidity list_of_features_to_exclude | None feature_names | {temperature,humidity}Now use feature_names from the summary table above to name the columns of the split array:
DROP TABLE IF EXISTS vec2cols_result; SELECT madlib.vec2cols( 'cols2vec_result', -- source table containing the feature vector 'vec2cols_result', -- output table 'feature_vector', -- column with array entries to split (SELECT feature_names from cols2vec_result_summary), -- feature_names from summary table of cols2vec 'id' -- columns to keep from source table ); SELECT * FROM vec2cols_result ORDER BY id;
id | temperature | humidity ----+-------------+---------- 1 | 85 | 85 2 | 80 | 90 3 | 83 | 78 4 | 70 | 96 5 | 68 | 80 6 | 65 | 70 7 | 64 | 65 8 | 72 | 95 9 | 69 | 70 10 | 75 | 80 11 | 75 | 70 12 | 72 | 90 13 | 81 | 75 14 | 71 | 80 (14 rows)This is the same as the format of the original 'golf' dataset that we started with.