2.1.0
User Documentation for Apache MADlib
Create Indicator Variables
Warning
This version of encoding categorical variables has been deprecated. The new module with more capability can be found here Encoding Categorical Variables

Coding systems for categorical variables
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot be entered into the regression equation just as they are. For example, if you have a variable called race that is coded 1 = Hispanic, 2 = Asian, 3 = Black, 4 = White, then entering race in your regression will look at the linear effect of race, which is probably not what you intended. Instead, categorical variables like this need to be recoded into a series of indicator variables which can then be entered into the regression model. There are a variety of coding systems (also called as contrasts) that can be used when coding categorical variables. including dummy, effects, orthogonal, and helmert coding.

We currently only support the dummy coding technique. Dummy coding is used when a researcher wants to compare other groups of the predictor variable with one specific group of the predictor variable. Often, the specific group to compare with is called the reference group.

create_indicator_variables(
    source_table,
    output_table,
    categorical_cols,
    keep_null,
    distributed_by
    )

Arguments

source_table
VARCHAR. Name of the source table, containing data for categorical variables.
output_table
VARCHAR. Name of result table. The output table has the same columns as the original table, adding new indicator variable columns for each categorical column. The column name for the indicator variable is 'categorical column name'_'categorical value'.
categorical_cols
VARCHAR. Comma-separated string of column names of categorical variables that need to be dummy-coded.
keep_null (optional)
BOOLEAN. default: FALSE. Whether 'NULL' should be treated as one of the categories of the categorical variable. If True, then an indicator variable is created corresponding to the NULL value. If False, then all indicator variables for that record will be set to NULL.
distributed_by (optional)
VARCHAR. default: NULL. Columns to use for the distribution policy of the output table. When NULL, the distribution policy of 'source_table' will be used. This argument is not available for POSTGRESQL platforms.

Examples
  1. Use a subset of the abalone dataset.
    DROP TABLE IF EXISTS abalone;
    CREATE TABLE abalone (
        sex character varying,
        length double precision,
        diameter double precision,
        height double precision
    );
    COPY abalone (sex, length, diameter, height) FROM stdin WITH DELIMITER '|' NULL as '@';
    M| 0.455 |   0.365 | 0.095
    F| 0.53  |   0.42  | 0.135
    M| 0.35  |   0.265 | 0.09
    F| 0.53  |   0.415 | 0.15
    M| 0.44  |   0.365 | 0.125
    F| 0.545 |   0.425 | 0.125
    I| 0.33  |   0.255 | 0.08
    F| 0.55  |   0.44  | 0.15
    I| 0.425 |   0.30  | 0.095
    F| 0.525 |   0.38  | 0.140
    M| 0.475 |   0.37  | 0.125
    F| 0.535 |   0.405 | 0.145
    M| 0.43  |   0.358 | 0.11
    F| 0.47  |   0.355 | 0.100
    M| 0.49  |   0.38  | 0.135
    F| 0.44  |   0.340 | 0.100
    M| 0.5   |   0.400 | 0.13
    F| 0.565 |   0.44  | 0.155
    I| 0.355 |   0.280 | 0.085
    F| 0.550 |   0.415 | 0.135
    | 0.475 |   0.37  | 0.125
    \.
    
  2. Create new table with dummy-coded indicator variables
    drop table if exists abalone_out;
    select madlib.create_indicator_variables ('abalone', 'abalone_out', 'sex');
    select * from abalone_out;
    
     sex  | length | diameter | height | sex_F  | sex_I  | sex_M
      -----+--------+----------+--------+--------+--------+-------
     F    |   0.53 |     0.42 |  0.135 |      1 |      0 |     0
     F    |   0.53 |    0.415 |   0.15 |      1 |      0 |     0
     F    |  0.545 |    0.425 |  0.125 |      1 |      0 |     0
     F    |   0.55 |     0.44 |   0.15 |      1 |      0 |     0
     F    |  0.525 |     0.38 |   0.14 |      1 |      0 |     0
     F    |  0.535 |    0.405 |  0.145 |      1 |      0 |     0
     F    |   0.47 |    0.355 |    0.1 |      1 |      0 |     0
     F    |   0.44 |     0.34 |    0.1 |      1 |      0 |     0
     F    |  0.565 |     0.44 |  0.155 |      1 |      0 |     0
     F    |   0.55 |    0.415 |  0.135 |      1 |      0 |     0
     M    |  0.455 |    0.365 |  0.095 |      0 |      0 |     1
     M    |   0.35 |    0.265 |   0.09 |      0 |      0 |     0
     M    |   0.44 |    0.365 |  0.125 |      0 |      0 |     0
     I    |   0.33 |    0.255 |   0.08 |      0 |      1 |     0
     I    |  0.425 |      0.3 |  0.095 |      0 |      1 |     0
     M    |  0.475 |     0.37 |  0.125 |      0 |      0 |     0
     M    |   0.43 |    0.358 |   0.11 |      0 |      0 |     0
     M    |   0.49 |     0.38 |  0.135 |      0 |      0 |     0
     M    |    0.5 |      0.4 |   0.13 |      0 |      0 |     0
     I    |  0.355 |     0.28 |  0.085 |      0 |      1 |     0
     NULL |   0.55 |    0.415 |  0.135 |   NULL |   NULL |  NULL
    
  3. Create indicator variable for 'NULL' value (note the additional column '"sex_NULL"')
    drop table if exists abalone_out;
    select madlib.create_indicator_variables'abalone', 'abalone_out', 'sex', True);
    select * from abalone_out;
    
     sex  | length | diameter | height | sex_F  | sex_I  | sex_M | sex_NULL
      ---—+-----—+-------—+-----—+-----—+-----—+----—+----—
     F    |   0.53 |     0.42 |  0.135 |      1 |      0 |     0 |     0
     F    |   0.53 |    0.415 |   0.15 |      1 |      0 |     0 |     0
     F    |  0.545 |    0.425 |  0.125 |      1 |      0 |     0 |     0
     F    |   0.55 |     0.44 |   0.15 |      1 |      0 |     0 |     0
     F    |  0.525 |     0.38 |   0.14 |      1 |      0 |     0 |     0
     F    |  0.535 |    0.405 |  0.145 |      1 |      0 |     0 |     0
     F    |   0.47 |    0.355 |    0.1 |      1 |      0 |     0 |     0
     F    |   0.44 |     0.34 |    0.1 |      1 |      0 |     0 |     0
     F    |  0.565 |     0.44 |  0.155 |      1 |      0 |     0 |     0
     F    |   0.55 |    0.415 |  0.135 |      1 |      0 |     0 |     0
     M    |  0.455 |    0.365 |  0.095 |      0 |      0 |     1 |     0
     M    |   0.35 |    0.265 |   0.09 |      0 |      0 |     0 |     0
     M    |   0.44 |    0.365 |  0.125 |      0 |      0 |     0 |     0
     I    |   0.33 |    0.255 |   0.08 |      0 |      1 |     0 |     0
     I    |  0.425 |      0.3 |  0.095 |      0 |      1 |     0 |     0
     M    |  0.475 |     0.37 |  0.125 |      0 |      0 |     0 |     0
     M    |   0.43 |    0.358 |   0.11 |      0 |      0 |     0 |     0
     M    |   0.49 |     0.38 |  0.135 |      0 |      0 |     0 |     0
     M    |    0.5 |      0.4 |   0.13 |      0 |      0 |     0 |     0
     I    |  0.355 |     0.28 |  0.085 |      0 |      1 |     0 |     0
     NULL |   0.55 |    0.415 |  0.135 |      0 |      0 |     0 |     1