1.15.1
User Documentation for Apache MADlib

A decision tree is a supervised learning method that can be used for classification and regression. It consists of a structure in which internal nodes represent tests on attributes, and the branches from nodes represent the result of those tests. Each leaf node is a class label and the paths from root to leaf nodes define the set of classification or regression rules.

Training Function
We implement the decision tree using the CART algorithm introduced by Breiman et al. [1]. The training function has the following syntax:
tree_train(
    training_table_name,
    output_table_name,
    id_col_name,
    dependent_variable,
    list_of_features,
    list_of_features_to_exclude,
    split_criterion,
    grouping_cols,
    weights,
    max_depth,
    min_split,
    min_bucket,
    num_splits,
    pruning_params,
    null_handling_params,
    verbosity
    )
Arguments
training_table_name

TEXT. Name of the table containing the training data.

output_table_name

TEXT. Name of the generated table containing the model. If a table with the same name already exists, an error will be returned. A summary table named <output_table_name>_summary is also created. A cross-validation table <output_table_name>_cv may also be created. These are described later on this page.

id_col_name

TEXT. Name of the column containing id information in the training data. This is a mandatory argument and is used for prediction and cross-validation. The values are expected to be unique for each row.

dependent_variable

TEXT. Name of the column that contains the output (response) for training. Boolean, integer and text types are considered to be classification outputs, while double precision values are considered to be regression outputs. The response variable for a classification tree can be multinomial, but the time and space complexity of the training function increases linearly as the number of response classes increases.

list_of_features

TEXT. Comma-separated string of column names or expressions to use as predictors. Can also be a '*' implying all columns are to be used as predictors (except for the ones included in the next argument that lists exclusions). The types of the features can be mixed: boolean, integer, and text columns are considered categorical and double precision columns are considered continuous. Categorical variables are not encoded and used as is in the training.

Array columns can also be included in the list, where the array is expanded to treat each element of the array as a feature.

Note that not every combination of the levels of a categorical variable is checked when evaluating a split. The levels of the non-integer categorical variable are ordered by the entropy of the variable in predicting the response. The split at each node is evaluated between these ordered levels. Integer categorical variables, however, are simply ordered by their value.

list_of_features_to_exclude

TEXT. Comma-separated string of column names to exclude from the predictors list. If the dependent_variable is an expression (including cast of a column name), then this list should include the columns present in the dependent_variable expression, otherwise those columns will be included in the features. The names in this parameter should be identical to the names used in the table and quoted appropriately.

split_criterion

TEXT, default = 'gini' for classification, 'mse' for regression. Impurity function to compute the feature to use to split a node. Supported criteria are 'gini', 'entropy', 'misclassification' for classification trees. For regression trees, split_criterion of 'mse' (mean-squared error) is always used, irrespective of the input for this argument. Refer to reference [1] for more information on impurity measures.

grouping_cols (optional)

TEXT, default: NULL. Comma-separated list of column names to group the data by. This will produce multiple decision trees, one for each group.

weights (optional)

TEXT. Column name containing numerical weights for each observation. Can be any value greater than 0 (does not need to be an integer). This can be used to handle the case of unbalanced data sets. The weights are used to compute a weighted average in the output leaf node. For classification, the contribution of a row towards the vote of its corresponding level is multiplied by the weight (weighted mode). For regression, the output value of the row is multiplied by the weight (weighted mean).

max_depth (optional)

INTEGER, default: 7. Maximum depth of any node of the final tree, with the root node counted as depth 0. A deeper tree can lead to better prediction but will also result in longer processing time and higher memory usage.

min_split (optional)

INTEGER, default: 20. Minimum number of observations that must exist in a node for a split to be attempted. The best value for this parameter depends on the number of tuples in the dataset.

min_bucket (optional)

INTEGER, default: min_split/3. Minimum number of observations in any terminal node. If only one of min_bucket or min_split is specified, min_split is set to min_bucket*3 or min_bucket to min_split/3, as appropriate.

num_splits (optional)

INTEGER, default: 20. Continuous-valued features are binned into discrete quantiles to compute split boundaries. This global parameter is used to compute the resolution of splits for continuous features. Higher number of bins will lead to better prediction, but will also result in longer processing time and higher memory usage.

pruning_params (optional)

TEXT. Comma-separated string of key-value pairs giving the parameters for pruning the tree.

cp

Default: 0. Complexity parameter. A split on a node is attempted only if it decreases the overall lack of fit by a factor of 'cp', otherwise the split is pruned away. This value is used to create an initial tree before running cross-validation (see below).

n_folds

Default: 0 (i.e. no cross-validation). Number of cross-validation folds to use to compute the best value of cp. To perform cross-validation, a positive value of n_folds (2 or more) should be specified. An additional output table <model_table>_cv is created containing the values of evaluated cp and the cross-validation error statistics. The tree returned in the output table corresponds to the cp with the lowest cross-validation error (we pick the maximum cp if multiple values have same error).

The list of cp values is automatically computed by parsing through the tree initially trained on the complete dataset. The tree output is a subset of this initial tree corresponding to the best computed cp.

null_handling_params (optional)

TEXT. Comma-separated string of key-value pairs controlling the behavior of various features handling missing values. One of the following can be used if desired (not both):

max_surrogates Default: 0. Number of surrogates to store for each node. One approach to handling NULLs is to use surrogate splits for each node. A surrogate variable enables you to make better use of the data by using another predictor variable that is associated (correlated) with the primary split variable. The surrogate variable comes into use when the primary predictior value is NULL. Surrogate rules implemented here are based on reference [1].
null_as_category

Default: FALSE. Whether to treat NULL as a valid level for categorical features. FALSE means that NULL is not a valid level, which is probably the most common sitation.

If set to TRUE, NULL values are considered a categorical value and placed at the end of the ordering of categorical levels. Placing at the end ensures that NULL is never used as a value to split a node on. One reason to make NULL a category is that it allows you to predict on categorical levels that were not in the training data by lumping them into an "other bucket."

This parameter is ignored for continuous-valued features.

verbosity (optional)
BOOLEAN, default: FALSE. Provides verbose output of the training result.

Output

The model table produced by the training function contains the following columns:

<...> Grouping columns, if provided as input, in the same types as the training table. This could be multiple columns depending on the grouping_cols input.
tree BYTEA8. Trained decision tree model stored in binary format (not human readable).
cat_levels_in_text TEXT[]. Ordered levels (values) of categorical variables corresponding to the categorical features in the 'list_of_features' argument above. Used to help interpret the trained decision tree. For example, if the categorical features specified are weather_outlook and windy in that order, then 'cat_levels_in_text' might be [overcast, rain, sunny, False, True].
cat_n_levels INTEGER[]. Number of levels for each categorical variable. Used to help interpret the trained decision tree. In the example from above, 'cat_n_levels' would be [3, 2] since there are 3 levels for weather_outlook and 2 levels windy.
impurity_var_importance

DOUBLE PRECISION[]. Impurity importance of each variable. The order of the variables is the same as that of the 'independent_varnames' column in the summary table (see below).

The impurity importance of any feature is the decrease in impurity by a node containing the feature as a primary split, summed over the whole tree. If surrogates are used, then the importance value includes the impurity decrease scaled by the adjusted surrogate agreement. Importance values are displayed as raw values as per the 'split_criterion' parameter. To see importance values normalized to sum to 100 across all variables, use the importance display helper function described later on this page. Please refer to [1] for more information on variable importance.

tree_depth

INTEGER. The maximum depth the tree obtained after training (root has depth 0).

pruning_cp

DOUBLE PRECISION. The cost complexity parameter used for pruning the trained tree(s). This could be different than the cp value input using the pruning_params if cross-validation is used.

A summary table named <output_table_name>_summary is also created at the same time, which has the following columns:

method

TEXT. 'tree_train'

is_classification

BOOLEAN. TRUE if the decision trees are for classification, FALSE if for regression.

source_table

TEXT. The data source table name.

model_table

TEXT. The model table name.

id_col_name

TEXT. The ID column name.

list_of_features

TEXT. The list_of_features inputed to the 'tree_train' procedure.

list_of_features_to_exclude

TEXT. The list_of_features_to_exclude inputed to the 'tree_train' procedure.

dependent_varname

TEXT. The dependent variable.

independent_varnames

TEXT. The independent variables. These are the features used in the training of the decision tree.

cat_features

TEXT. The list of categorical feature names as a comma-separated string.

con_features

TEXT. The list of continuous feature names as a comma-separated string.

grouping_cols

TEXT. Names of grouping columns.

num_all_groups

INTEGER. Number of groups in decision tree training.

num_failed_groups

INTEGER. Number of failed groups in decision tree training.

total_rows_processed

BIGINT. Total numbers of rows processed in all groups.

total_rows_skipped

BIGINT. Total numbers of rows skipped in all groups due to missing values or failures.

dependent_var_levels

TEXT. For classification, the distinct levels of the dependent variable.

dependent_var_type

TEXT. The type of dependent variable.

input_cp

DOUBLE PRECISION. The complexity parameter (cp) used for pruning the trained tree(s) before cross-validation is run. This is same as the cp value input using the pruning_params.

independent_var_types

TEXT. A comma separated string for the types of independent variables.

n_folds

BIGINT. Number of cross-validation folds used.

null_proxy TEXT. Describes how NULLs are handled. If NULL is not treated as a separate categorical variable, this will be NULL. If NULL is treated as a separate categorical value, this will be set to "__NULL__"

A cross-validation table called <output_table_name>_cv is created if 'n_folds' is set in the 'pruning_params'. It has the following columns:

cp

DOUBLE PRECISION. Complexity parameter.

cv_error_avg

DOUBLE PRECISION. Average error resulting from cp value.

cv_error_stdev DOUBLE PRECISION. Standard deviation resulting from cp value.
Note
  • Many of the parameters are designed to be similar to the popular R package 'rpart'. An important distinction between rpart and the MADlib function is that for both response and feature variables, MADlib considers integer values as categorical values, while rpart considers them as continuous. To use integers as continuous, cast them to double precision.
  • Integer values are ordered by value for computing the split boundaries. Cast to TEXT if the entropy-based ordering method is desired.
  • When cross-validation is not used (n_folds=0), each tree output is pruned by the input cost complexity (cp). With cross-validation, the input cp is the minimum value of all the explored values of 'cp'. During cross-validation, we train an initial tree using the provided cp and explore all possible sub-trees (up to a single-node tree) to compute the optimal sub-tree. The optimal sub-tree and the 'cp' corresponding to this optimal sub-tree is placed in the output_table, with the columns named as tree and pruning_cp respectively.
  • The main parameters that affect memory usage are: depth of tree (‘max_depth’), number of features, number of values per categorical feature, and number of bins for continuous features (‘num_splits’). If you are hitting memory limits, consider reducing one or more of these parameters.

Prediction Function
The prediction function estimates the conditional mean given a new predictor. It has the following syntax:
tree_predict(tree_model,
             new_data_table,
             output_table,
             type)

Arguments

tree_model

TEXT. Name of the table containing the decision tree model. This should be the output table returned from tree_train.

new_data_table

TEXT. Name of the table containing prediction data. This table is expected to contain the same features that were used during training. The table should also contain id_col_name used for identifying each row.

output_table

TEXT. Name of the table to output prediction results. If this table already exists, an error is returned. The table contains the id_col_name column giving the 'id' for each prediction and the prediction columns for the dependent variable.

If type = 'response', then the table has a single additional column with the prediction value of the response. The type of this column depends on the type of the response variable used during training.

If type = 'prob', then the table has multiple additional columns, one for each possible value of the response variable. The columns are labeled as 'estimated_prob_dep_value', where dep_value represents each value of the response variable.

type
TEXT, optional, default: 'response'. For regression trees, the output is always the predicted value of the dependent variable. For classification trees, the type variable can be 'response', giving the classification prediction as output, or 'prob', giving the class probabilities as output. For each value of the dependent variable, a column with the probabilities is added to the output table.

Tree Display
The display function outputs a graph representation of the decision tree. The output can either be in the popular 'dot' format that can be visualized using various programs including those in the GraphViz package, or in a simple text format. The details of the text format are output with the tree.
tree_display(tree_model, dot_format, verbosity)

An additional display function is provided to output the surrogate splits chosen for each internal node:

tree_surr_display(tree_model)

This output contains the list of surrogate splits for each internal node. The nodes are sorted in ascending order by id. This is equivalent to viewing the tree in a breadth-first manner. For each surrogate, we output the surrogate split (variable and threshold) and also give the number of rows that were common between the primary split and the surrogate split. Finally, the number of rows present in the majority branch of the primary split is also shown. Only surrogates that perform better than this majority branch are included in the surrogate list. When the primary variable has a NULL value the surrogate variables are used in order to compute the split for that node. If all surrogates variables are NULL, then the majority branch is used to compute the split for a tuple.

Arguments

tree_model
TEXT. Name of the table containing the decision tree model.
dot_format
BOOLEAN, default = TRUE. Output can either be in a dot format or a text format. If TRUE, the result is in the dot format, else output is in text format.
verbosity
BOOLEAN, default = FALSE. If set to TRUE, the dot format output will contain additional information (impurity, sample size, number of weighted rows for each response variable, classification or prediction if the tree was pruned at this level)

The output is always returned as a 'TEXT'. For the dot format, the output can be redirected to a file on the client side and then rendered using visualization programs.

To export the dot format result to an external file, use the method below. Please note that you should use unaligned table output mode for psql with '-A' flag, or else you may get an error when you try to convert the dot file to another format for viewing (e.g., PDF). And inside the psql client, both '\t' and '\o' should be used:

> # under bash
> psql -A my_database
# -- in psql now
# \t
# \o test.dot -- export to a file
# select madlib.tree_display('tree_out');
# \o
# \t

After the dot file has been generated, use third-party plotting software to plot the trees in a nice format:

> # under bash, convert the dot file into a PDF file
> dot -Tpdf test.dot > test.pdf
> xpdf test.pdf&

Please see the examples below for more details on the contents of the tree output formats.

An additional display function is provided to output the surrogate splits chosen for each internal node:

tree_surr_display(tree_model)

Importance Display

This is a helper function that creates a table to more easily view impurity variable importance values for a given model table. This function rescales the importance values to represent them as percentages i.e. importance values are scaled to sum to 100.

get_var_importance(model_table, output_table)

Arguments

model_table
TEXT. Name of the table containing the decision tree model.
output_table
TEXT. Name of the table to create for importance values.

The summary table generated by the tree_train function is necessary for this function to work.

Examples

Decision Tree Classification Examples

  1. Load input data set related to whether to play golf or not:
    DROP TABLE IF EXISTS dt_golf CASCADE;
    CREATE TABLE dt_golf (
        id integer NOT NULL,
        "OUTLOOK" text,
        temperature double precision,
        humidity double precision,
        "Temp_Humidity" double precision[],
        clouds_airquality text[],
        windy boolean,
        class text,
        observation_weight double precision
    );
    INSERT INTO dt_golf VALUES
    (1,'sunny', 85, 85, ARRAY[85, 85],ARRAY['none', 'unhealthy'], 'false','Don''t Play', 5.0),
    (2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['none', 'moderate'], 'true', 'Don''t Play', 5.0),
    (3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['low', 'moderate'], 'false', 'Play', 1.5),
    (4, 'rain', 70, 96, ARRAY[70, 96], ARRAY['low', 'moderate'], 'false', 'Play', 1.0),
    (5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
    (6, 'rain', 65, 70, ARRAY[65, 70], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0),
    (7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
    (8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['high', 'unhealthy'], 'false', 'Don''t Play', 5.0),
    (9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['high', 'good'], 'false', 'Play', 5.0),
    (10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
    (11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['none', 'good'], 'true', 'Play', 5.0),
    (12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
    (13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['medium', 'moderate'], 'false', 'Play', 1.5),
    (14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0);
    
  2. Run the decision tree training function:
    DROP TABLE IF EXISTS train_output, train_output_summary;
    SELECT madlib.tree_train('dt_golf',         -- source table
                             'train_output',    -- output model table
                             'id',              -- id column
                             'class',           -- response
                             '"OUTLOOK", temperature, windy',   -- features
                             NULL::text,        -- exclude columns
                             'gini',            -- split criterion
                             NULL::text,        -- no grouping
                             NULL::text,        -- no weights, all observations treated equally
                             5,                 -- max depth
                             3,                 -- min split
                             1,                 -- min bucket
                             10                 -- number of bins per continuous variable
                             );
    
    View the output table (excluding the tree which is in binary format):
    \x on
    SELECT pruning_cp, cat_levels_in_text, cat_n_levels, impurity_var_importance, tree_depth FROM train_output;
    
    -[ RECORD 1 ]-----------+--------------------------------------
    pruning_cp              | 0
    cat_levels_in_text      | {overcast,rain,sunny,False,True}
    cat_n_levels            | {3,2}
    impurity_var_importance | {0.102040816326531,0,0.85905612244898}
    tree_depth              | 5
    
    View the summary table:
    \x on
    SELECT * FROM train_output_summary;
    
    -[ RECORD 1 ]---------------+--------------------------------
    method                      | tree_train
    is_classification           | t
    source_table                | dt_golf
    model_table                 | train_output
    id_col_name                 | id
    list_of_features            | "OUTLOOK", temperature, windy
    list_of_features_to_exclude | None
    dependent_varname           | class
    independent_varnames        | "OUTLOOK",windy,temperature
    cat_features                | "OUTLOOK",windy
    con_features                | temperature
    grouping_cols               |
    num_all_groups              | 1
    num_failed_groups           | 0
    total_rows_processed        | 14
    total_rows_skipped          | 0
    dependent_var_levels        | "Don't Play","Play"
    dependent_var_type          | text
    input_cp                    | 0
    independent_var_types       | text, boolean, double precision
    n_folds                     | 0
    null_proxy                  |
    
    View the normalized impurity importance table using the helper function:
    \x off
    DROP TABLE IF EXISTS imp_output;
    SELECT madlib.get_var_importance('train_output','imp_output');
    SELECT * FROM imp_output;
    
       feature   | impurity_var_importance
    -------------+-------------------------
     "OUTLOOK"   |        10.6171090593052
     windy       |                       0
     temperature |         89.382786893026
    
  3. Predict output categories. For the purpose of this example, we use the same data that was used for training:
    \x off
    DROP TABLE IF EXISTS prediction_results;
    SELECT madlib.tree_predict('train_output',          -- tree model
                               'dt_golf',               -- new data table
                               'prediction_results',    -- output table
                               'response');             -- show response
    SELECT g.id, class, estimated_class FROM prediction_results p,
    dt_golf g WHERE p.id = g.id ORDER BY g.id;
    
     id |   class    | estimated_class
    ----+------------+-----------------
      1 | Don't Play | Don't Play
      2 | Don't Play | Don't Play
      3 | Play       | Play
      4 | Play       | Play
      5 | Play       | Play
      6 | Don't Play | Don't Play
      7 | Play       | Play
      8 | Don't Play | Don't Play
      9 | Play       | Play
     10 | Play       | Play
     11 | Play       | Play
     12 | Play       | Play
     13 | Play       | Play
     14 | Don't Play | Don't Play
    (14 rows)
    
    To display the probabilities associated with each value of the dependent variable, set the 'type' parameter to 'prob':
    DROP TABLE IF EXISTS prediction_results;
    SELECT madlib.tree_predict('train_output',          -- tree model
                               'dt_golf',               -- new data table
                               'prediction_results',    -- output table
                               'prob');                 -- show probability
    SELECT g.id, class, "estimated_prob_Don't Play",  "estimated_prob_Play"
    FROM prediction_results p, dt_golf g WHERE p.id = g.id ORDER BY g.id;
    
     id |   class    | estimated_prob_Don't Play | estimated_prob_Play
    ----+------------+---------------------------+---------------------
      1 | Don't Play |                         1 |                   0
      2 | Don't Play |                         1 |                   0
      3 | Play       |                         0 |                   1
      4 | Play       |                         0 |                   1
      5 | Play       |                         0 |                   1
      6 | Don't Play |                         1 |                   0
      7 | Play       |                         0 |                   1
      8 | Don't Play |                         1 |                   0
      9 | Play       |                         0 |                   1
     10 | Play       |                         0 |                   1
     11 | Play       |                         0 |                   1
     12 | Play       |                         0 |                   1
     13 | Play       |                         0 |                   1
     14 | Don't Play |                         1 |                   0
    (14 rows)
    
  4. View the tree in text format:
    SELECT madlib.tree_display('train_output', FALSE);
    
     -------------------------------------
     - Each node represented by 'id' inside ().
     - Each internal nodes has the split condition at the end, while each
            leaf node has a * at the end.
     - For each internal node (i), its child nodes are indented by 1 level
            with ids (2i+1) for True node and (2i+2) for False node.
     - Number of (weighted) rows for each response variable inside [].'
            The response label order is given as ['"\'Don\'t Play\'"', '"\'Play\'"'].
            For each leaf, the prediction is given after the '-->'
     -------------------------------------
     (0)[5 9]  "OUTLOOK" in {overcast}
        (1)[0 4]  * --> "Play"
        (2)[5 5]  temperature <= 75
           (5)[3 5]  temperature <= 65
              (11)[1 0]  * --> "Don't Play"
              (12)[2 5]  temperature <= 70
                 (25)[0 3]  * --> "Play"
                 (26)[2 2]  temperature <= 72
                    (53)[2 0]  * --> "Don't Play"
                    (54)[0 2]  * --> "Play"
           (6)[2 0]  * --> "Don't Play"
     -------------------------------------
    
    Here are some details on how to interpret the tree display above:
    • Node numbering starts at 0 for the root node and would be contiguous 1,2...n if the tree was completely full (no pruning). Since the tree has been pruned, the node numbering is not contiguous.
    • The order of values [x y] indicates the number of weighted rows that correspond to ["Don't play" "Play"] before the node test. For example, at (root) node 0, there are 5 rows for "Don't play" and 9 rows for "Play" in the raw data.
    • If we apply the test of "OUTLOOK" being overcast, then the True (yes) result is leaf node 1 which is "Play". There are 0 "Don't play" rows and 4 "Play" rows that correspond to this case (overcast). In other words, if it is overcast, you always play golf. If it is not overcast, you may or may not play golf, depending on the rest of the tree.
    • The remaining 5 "Don't play" rows and 5 "Play rows" are then tested at node 2 on temperature<=75. The False (no) result is leaf node 6 which is "Don't Play". The True (yes) result proceeds to leaf node 5 to test on temperature<=65. And so on down the tree.
    • Creating a dot format visualization of the tree, as described below, can help with following the decision flows.
  5. Create a dot format display of the tree:
    SELECT madlib.tree_display('train_output', TRUE);
    
     digraph "Classification tree for dt_golf" {
              subgraph "cluster0"{
              label=""
     "g0_0" [label="\"OUTLOOK" <= overcast", shape=ellipse];
     "g0_0" -> "g0_1"[label="yes"];
     "g0_1" [label=""Play"",shape=box];
     "g0_0" -> "g0_2"[label="no"];
     "g0_2" [label="temperature <= 75", shape=ellipse];
     "g0_2" -> "g0_5"[label="yes"];
     "g0_2" -> "g0_6"[label="no"];
     "g0_6" [label=""Don't Play"",shape=box];
     "g0_5" [label="temperature <= 65", shape=ellipse];
     "g0_5" -> "g0_11"[label="yes"];
     "g0_11" [label=""Don't Play"",shape=box];
     "g0_5" -> "g0_12"[label="no"];
     "g0_12" [label="temperature <= 70", shape=ellipse];
     "g0_12" -> "g0_25"[label="yes"];
     "g0_25" [label=""Play"",shape=box];
     "g0_12" -> "g0_26"[label="no"];
     "g0_26" [label="temperature <= 72", shape=ellipse];
     "g0_26" -> "g0_53"[label="yes"];
     "g0_53" [label=""Don't Play"",shape=box];
     "g0_26" -> "g0_54"[label="no"];
     "g0_54" [label=""Play"",shape=box];
       } //--- end of subgraph------------
     } //---end of digraph---------
    
    One important difference to note about the dot format above is how categorical variable tests are displayed:
    • In the text format of the tree, the node 0 test is "OUTLOOK" in {overcast}, but in the dot format of the tree, the same node 0 test reads "\"OUTLOOK" <= overcast". This is because in dot format for categorical variables, the '<=' symbol represents the location in the array 'cat_levels_in_text' from the output table for the "OUTLOOK" levels. The array is ['overcast', 'rain', 'sunny', 'False', 'True'] with the first 3 entries corresponding to "OUTLOOK" and the last 2 entries corresponding to 'windy'. So the test "\"OUTLOOK" <= overcast" means all "OUTLOOK" levels to the left of, and including, 'overcast'. In this case there are no levels to the left of 'overcast' in the array so it is simply a test on whether it is overcast or not.
    • If there was a test "\"OUTLOOK" <= rain", this would include both 'overcast' and 'rain', since 'overcast' is to the left of 'rain' in the array.
    • If there was a test "windy <= True", this would include both 'False' and 'True', since 'False' is to the left of 'True' in the array.
  6. Now create a dot format display of the tree with additional information:
    SELECT madlib.tree_display('train_output', TRUE, TRUE);
    
     digraph "Classification tree for dt_golf" {
              subgraph "cluster0"{
              label=""
     "g0_0" [label="\"OUTLOOK" <= overcast\n impurity = 0.459184\n samples = 14\n value = [5 9]\n class = "Play"", shape=ellipse];
     "g0_0" -> "g0_1"[label="yes"];
     "g0_1" [label=""Play"\n impurity = 0\n samples = 4\n value = [0 4]",shape=box];
     "g0_0" -> "g0_2"[label="no"];
     "g0_2" [label="temperature <= 75\n impurity = 0.5\n samples = 10\n value = [5 5]\n class = "Don't Play"", shape=ellipse];
     "g0_2" -> "g0_5"[label="yes"];
     "g0_2" -> "g0_6"[label="no"];
     "g0_6" [label=""Don't Play"\n impurity = 0\n samples = 2\n value = [2 0]",shape=box];
     "g0_5" [label="temperature <= 65\n impurity = 0.46875\n samples = 8\n value = [3 5]\n class = "Play"", shape=ellipse];
     "g0_5" -> "g0_11"[label="yes"];
     "g0_11" [label=""Don't Play"\n impurity = 0\n samples = 1\n value = [1 0]",shape=box];
     "g0_5" -> "g0_12"[label="no"];
     "g0_12" [label="temperature <= 70\n impurity = 0.408163\n samples = 7\n value = [2 5]\n class = "Play"", shape=ellipse];
     "g0_12" -> "g0_25"[label="yes"];
     "g0_25" [label=""Play"\n impurity = 0\n samples = 3\n value = [0 3]",shape=box];
     "g0_12" -> "g0_26"[label="no"];
     "g0_26" [label="temperature <= 72\n impurity = 0.5\n samples = 4\n value = [2 2]\n class = "Don't Play"", shape=ellipse];
     "g0_26" -> "g0_53"[label="yes"];
     "g0_53" [label=""Don't Play"\n impurity = 0\n samples = 2\n value = [2 0]",shape=box];
     "g0_26" -> "g0_54"[label="no"];
     "g0_54" [label=""Play"\n impurity = 0\n samples = 2\n value = [0 2]",shape=box];
       } //--- end of subgraph------------
     } //---end of digraph---------
    
    The additional information in each node is: impurity, sample size, number of weighted rows for each response variable, and classification if the tree was pruned at this level. If your tree is not too big, you may wish to convert the dot format to PDF or another format for better visualization of the tree structure.
  7. Arrays of features. Categorical and continuous features can be array columns, in which case the array is expanded to treat each element of the array as a feature:
    DROP TABLE IF EXISTS train_output, train_output_summary;
    SELECT madlib.tree_train('dt_golf',         -- source table
                             'train_output',    -- output model table
                             'id',              -- id column
                             'class',           -- response
                             '"Temp_Humidity", clouds_airquality',   -- features
                             NULL::text,        -- exclude columns
                             'gini',            -- split criterion
                             NULL::text,        -- no grouping
                             NULL::text,        -- no weights, all observations treated equally
                             5,                 -- max depth
                             3,                 -- min split
                             1,                 -- min bucket
                             10                 -- number of bins per continuous variable
                             );
    
    View the output table (excluding the tree which is in binary format):
    \x on
    SELECT pruning_cp, cat_levels_in_text, cat_n_levels, impurity_var_importance, tree_depth FROM train_output;
    
    -[ RECORD 1 ]-----------+-----------------------------------------------------
    pruning_cp              | 0
    cat_levels_in_text      | {medium,none,high,low,unhealthy,good,moderate}
    cat_n_levels            | {4,3}
    impurity_var_importance | {0,0.330612244897959,0.0466666666666666,0.444444444444444}
    tree_depth              | 3
    
    The first 4 levels correspond to cloud ceiling and the next 3 levels correspond to air quality.
  8. Weighting observations. Use the 'weights' parameter to adjust a row's vote to balance the dataset. In our example, the weights are somewhat random but show that a different decision tree is create compared to the case where no weights are used:
    DROP TABLE IF EXISTS train_output, train_output_summary;
    SELECT madlib.tree_train('dt_golf',         -- source table
                             'train_output',    -- output model table
                             'id',              -- id column
                             'class',           -- response
                             '"OUTLOOK", temperature, windy',   -- features
                             NULL::text,        -- exclude columns
                             'gini',            -- split criterion
                             NULL::text,        -- no grouping
                             'observation_weight', -- weight observations
                             5,                 -- max depth
                             3,                 -- min split
                             1,                 -- min bucket
                             10                 -- number of bins per continuous variable
                             );
    SELECT madlib.tree_display('train_output');
    
      -------------------------------------
      - Each node represented by 'id' inside ().
      - Each internal nodes has the split condition at the end, while each
             leaf node has a * at the end.
      - For each internal node (i), its child nodes are indented by 1 level
             with ids (2i+1) for True node and (2i+2) for False node.
      - Number of (weighted) rows for each response variable inside [].'
             The response label order is given as ['"Don\'t Play"', '"Play"'].
             For each leaf, the prediction is given after the '-->'
      -------------------------------------
     (0)[17 19]  temperature <= 75
        (1)[ 7 16]  temperature <= 72
           (3)[ 7 10]  temperature <= 70
              (7)[  1 8.5]  * --> "Play"
              (8)[  6 1.5]  "OUTLOOK" in {overcast}
                 (17)[  0 1.5]  * --> "Play"
                 (18)[6 0]  * --> "Don't Play"
           (4)[0 6]  * --> "Play"
        (2)[10  3]  "OUTLOOK" in {overcast}
           (5)[0 3]  * --> "Play"
           (6)[10  0]  * --> "Don't Play"
    

Decision Tree Regression Examples

  1. Load input data related to fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Data was extracted from the 1974 Motor Trend US magazine.
    DROP TABLE IF EXISTS mt_cars;
    CREATE TABLE mt_cars (
        id integer NOT NULL,
        mpg double precision,
        cyl integer,
        disp double precision,
        hp integer,
        drat double precision,
        wt double precision,
        qsec double precision,
        vs integer,
        am integer,
        gear integer,
        carb integer
    );
    INSERT INTO mt_cars VALUES
    (1,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2),
    (2,21,6,160,110,3.9,2.62,16.46,0,1,4,4),
    (3,24.4,4,146.7,62,3.69,3.19,20,1,0,4,2),
    (4,21,6,160,110,3.9,2.875,17.02,0,1,4,4),
    (5,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4),
    (6,16.4,8,275.8,180,3.078,4.07,17.4,0,0,3,3),
    (7,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1),
    (8,17.3,8,275.8,180,3.078,3.73,17.6,0,0,3,3),
    (9,21.4,null,258,110,3.08,3.215,19.44,1,0,3,1),
    (10,15.2,8,275.8,180,3.078,3.78,18,0,0,3,3),
    (11,18.1,6,225,105,2.768,3.46,20.22,1,0,3,1),
    (12,32.4,4,78.7,66,4.08,2.20,19.47,1,1,4,1),
    (13,14.3,8,360,245,3.21,3.578,15.84,0,0,3,4),
    (14,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2),
    (15,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2),
    (16,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4),
    (17,33.9,4,71.14,65,4.22,1.835,19.9,1,1,4,1),
    (18,15.2,null,304,150,3.15,3.435,17.3,0,0,3,2),
    (19,10.4,8,472,205,2.93,5.25,17.98,0,0,3,4),
    (20,27.3,4,79,66,4.08,1.935,18.9,1,1,4,1),
    (21,10.4,8,460,215,3,5.424,17.82,0,0,3,4),
    (22,26,4,120.3,91,4.43,2.14,16.7,0,1,5,2),
    (23,14.7,8,440,230,3.23,5.345,17.42,0,0,3,4),
    (24,30.4,4,95.14,113,3.77,1.513,16.9,1,1,5,2),
    (25,21.5,4,120.1,97,3.70,2.465,20.01,1,0,3,1),
    (26,15.8,8,351,264,4.22,3.17,14.5,0,1,5,4),
    (27,15.5,8,318,150,2.768,3.52,16.87,0,0,3,2),
    (28,15,8,301,335,3.54,3.578,14.6,0,1,5,8),
    (29,13.3,8,350,245,3.73,3.84,15.41,0,0,3,4),
    (30,19.2,8,400,175,3.08,3.845,17.05,0,0,3,2),
    (31,19.7,6,145,175,3.62,2.77,15.5,0,1,5,6),
    (32,21.4,4,121,109,4.11,2.78,18.6,1,1,4,2);
    
  2. We train a regression decision tree with surrogates in order to handle the NULL feature values:
    DROP TABLE IF EXISTS train_output, train_output_summary, train_output_cv;
    SELECT madlib.tree_train('mt_cars',         -- source table
                             'train_output',    -- output model table
                             'id',              -- id column
                             'mpg',             -- dependent variable
                             '*',               -- features
                             'id, hp, drat, am, gear, carb',  -- exclude columns
                             'mse',             -- split criterion
                             NULL::text,        -- no grouping
                             NULL::text,        -- no weights, all observations treated equally
                             10,                -- max depth
                             8,                 -- min split
                             3,                 -- number of bins per continuous variable
                             10,                -- number of splits
                             NULL,              -- pruning parameters
                             'max_surrogates=2' -- number of surrogates
                             );
    
    View the output table (excluding the tree which is in binary format) which shows ordering of levels of categorical variables 'vs' and 'cyl':
    SELECT pruning_cp, cat_levels_in_text, cat_n_levels, impurity_var_importance, tree_depth FROM train_output;
    
    -[ RECORD 1 ]-----------+------------------------------------------------------------------------
    pruning_cp              | 0
    cat_levels_in_text      | {0,1,4,6,8}
    cat_n_levels            | {2,3}
    impurity_var_importance | {0,22.6309172500675,4.79024943310651,2.32115000000003,13.8967382920111}
    tree_depth              | 4
    
    View the summary table:
    \x on
    SELECT * FROM train_output_summary;
    
    -[ RECORD 1 ]---------------+-----------------------------------------------------------------------
    method                      | tree_train
    is_classification           | f
    source_table                | mt_cars
    model_table                 | train_output
    id_col_name                 | id
    list_of_features            | *
    list_of_features_to_exclude | id, hp, drat, am, gear, carb
    dependent_varname           | mpg
    independent_varnames        | vs,cyl,disp,qsec,wt
    cat_features                | vs,cyl
    con_features                | disp,qsec,wt
    grouping_cols               |
    num_all_groups              | 1
    num_failed_groups           | 0
    total_rows_processed        | 32
    total_rows_skipped          | 0
    dependent_var_levels        |
    dependent_var_type          | double precision
    input_cp                    | 0
    independent_var_types       | integer, integer, double precision, double precision, double precision
    n_folds                     | 0
    null_proxy                  |
    
    View the normalized impurity importance table using the helper function:
    \x off
    DROP TABLE IF EXISTS imp_output;
    SELECT madlib.get_var_importance('train_output','imp_output');
    SELECT * FROM imp_output ORDER BY impurity_var_importance DESC;
    
     feature | impurity_var_importance
    ---------+-------------------------
     cyl     |        51.8593190075796
     wt      |        31.8447271176382
     disp    |        10.9769776775887
     qsec    |        5.31897390566817
     vs      |                       0
    
  3. Predict regression output for the same data and compare with original:
    \x off
    DROP TABLE IF EXISTS prediction_results;
    SELECT madlib.tree_predict('train_output',
                               'mt_cars',
                               'prediction_results',
                               'response');
    SELECT s.id, mpg, estimated_mpg, mpg-estimated_mpg as delta
    FROM prediction_results p,
    mt_cars s WHERE s.id = p.id ORDER BY id;
    
    Result:
     id | mpg  |  estimated_mpg   |        delta
    ----+------+------------------+---------------------
      1 | 18.7 |            16.84 |                1.86
      2 |   21 | 19.7428571428571 |    1.25714285714286
      3 | 24.4 |            22.58 |                1.82
      4 |   21 | 19.7428571428571 |    1.25714285714286
      5 | 17.8 | 19.7428571428571 |   -1.94285714285714
      6 | 16.4 |            16.84 |  -0.439999999999998
      7 | 22.8 |            22.58 |   0.219999999999999
      8 | 17.3 |           13.325 |               3.975
      9 | 21.4 | 19.7428571428571 |    1.65714285714286
     10 | 15.2 |           13.325 |               1.875
     11 | 18.1 | 19.7428571428571 |   -1.64285714285714
     12 | 32.4 | 30.0666666666667 |    2.33333333333334
     13 | 14.3 |            14.78 |               -0.48
     14 | 22.8 |            22.58 |   0.219999999999999
     15 | 30.4 | 30.0666666666667 |   0.333333333333336
     16 | 19.2 | 19.7428571428571 |  -0.542857142857141
     17 | 33.9 | 30.0666666666667 |    3.83333333333334
     18 | 15.2 |            16.84 |               -1.64
     19 | 10.4 |           13.325 |              -2.925
     20 | 27.3 | 30.0666666666667 |   -2.76666666666666
     21 | 10.4 |           13.325 |              -2.925
     22 |   26 | 30.0666666666667 |   -4.06666666666666
     23 | 14.7 |            16.84 |               -2.14
     24 | 30.4 | 30.0666666666667 |   0.333333333333336
     25 | 21.5 |            22.58 |               -1.08
     26 | 15.8 |            14.78 |                1.02
     27 | 15.5 |            14.78 |   0.719999999999999
     28 |   15 |            14.78 |   0.219999999999999
     29 | 13.3 |            14.78 |               -1.48
     30 | 19.2 |            16.84 |                2.36
     31 | 19.7 | 19.7428571428571 | -0.0428571428571409
     32 | 21.4 |            22.58 |               -1.18
    (32 rows)
    
  4. Display the decision tree in basic text format:
    SELECT madlib.tree_display('train_output', FALSE);
    
      -------------------------------------
     - Each node represented by 'id' inside ().
     - Each internal nodes has the split condition at the end, while each
         leaf node has a * at the end.
     - For each internal node (i), its child nodes are indented by 1 level
         with ids (2i+1) for True node and (2i+2) for False node.
     - Number of rows and average response value inside []. For a leaf node, this is the prediction.
     -------------------------------------
     (0)[32, 20.0906]  cyl in {4}
        (1)[11, 26.6636]  wt <= 2.2
           (3)[6, 30.0667]  *
           (4)[5, 22.58]  *
        (2)[21, 16.6476]  disp <= 258
           (5)[7, 19.7429]  *
           (6)[14, 15.1]  qsec <= 17.42
              (13)[10, 15.81]  qsec <= 16.9
                 (27)[5, 14.78]  *
                 (28)[5, 16.84]  *
              (14)[4, 13.325]  *
      -------------------------------------
    (1 row)
    
  5. Display the surrogate variables that are used to compute the split for each node when the primary variable is NULL:
    SELECT madlib.tree_surr_display('train_output');
    
     -------------------------------------
           Surrogates for internal nodes
     -------------------------------------
     (0) cyl in {4}
          1: disp <= 146.7    [common rows = 29]
          2: vs in {1}    [common rows = 26]
          [Majority branch = 11 ]
     (1) wt <= 2.2
          [Majority branch = 19 ]
     (2) disp <= 258
          1: cyl in {4,6}    [common rows = 19]
          2: vs in {1}    [common rows = 18]
          [Majority branch = 7 ]
     (6) qsec <= 17.42
          1: disp > 275.8    [common rows = 11]
          2: vs in {0}    [common rows = 10]
          [Majority branch = 10 ]
     (13) qsec <= 16.9
          1: wt <= 3.84    [common rows = 8]
          2: disp <= 360    [common rows = 7]
          [Majority branch = 5 ]
     -------------------------------------
    (1 row)
    
    Note
    The 'cyl' parameter in the data set has two tuples with NULL values (id = 9 and id = 18). In the prediction based on this tree, the surrogate splits for the cyl in {4} split in node 0 are used to predict those two tuples. The splits are used in descending order until a surrogate variable is found that is not NULL. In this case, the two tuples have non-NULL values for disp, hence the disp <= 146.7 split is used to make the prediction. If all the surrogate variables are NULL then the majority branch would be followed.
  6. Now let's use cross validation to select the best value of the complexity parameter cp:
    DROP TABLE IF EXISTS train_output, train_output_summary, train_output_cv;
    SELECT madlib.tree_train('mt_cars',         -- source table
                             'train_output',    -- output model table
                             'id',              -- id column
                             'mpg',             -- dependent variable
                             '*',               -- features
                             'id, hp, drat, am, gear, carb',  -- exclude columns
                             'mse',             -- split criterion
                             NULL::text,        -- no grouping
                             NULL::text,        -- no weights, all observations treated equally
                             10,                -- max depth
                             8,                 -- min split
                             3,                 -- number of bins per continuous variable
                             10,                -- number of splits
                             'n_folds=3'       -- pruning parameters for cross validation
                             );
    
    View the output table (excluding the tree which is in binary format). The input cp value was 0 (default) and the best 'pruning_cp' value turns out to be 0 as well in this small example:
    SELECT pruning_cp, cat_levels_in_text, cat_n_levels, impurity_var_importance, tree_depth FROM train_output;
    
    -[ RECORD 1 ]-----------+-----------------------------------------------------------------------
    pruning_cp              | 0
    cat_levels_in_text      | {0,1,4,6,8}
    cat_n_levels            | {2,3}
    impurity_var_importance | {0,22.6309172500677,4.79024943310653,2.32115,13.8967382920109}
    tree_depth              | 4
    
    The cp values tested and average error and standard deviation are:
    SELECT * FROM train_output_cv ORDER BY cv_error_avg ASC;
    
             cp          |   cv_error_avg   | cv_error_stddev
    ---------------------+------------------+------------------
                       0 | 4.60222321567406 | 1.14990035501294
     0.00942145242026098 | 4.71906243157825 | 1.21587651168567
      0.0156685263245236 | 4.86688342751006 | 1.30225133441406
      0.0893348335770666 |  5.0608834230282 | 1.42488238861617
       0.135752855572154 | 5.33192746100332 | 1.62718329150341
       0.643125226048458 | 5.76814538295394 | 2.10750950120742
    (6 rows)
    

    NULL Handling Example

  1. Create toy example to illustrate 'null-as-category' handling for categorical features:
    DROP TABLE IF EXISTS null_handling_example;
    CREATE TABLE null_handling_example (
        id integer,
        country text,
        city text,
        weather text,
        response text
    );
    INSERT INTO null_handling_example VALUES
    (1,null,null,null,'a'),
    (2,'US',null,null,'b'),
    (3,'US','NY',null,'c'),
    (4,'US','NY','rainy','d');
    
  2. Train decision tree. Note that 'NULL' is set as a valid level for the categorical features country, weather and city:
    DROP TABLE IF EXISTS train_output, train_output_summary;
    SELECT madlib.tree_train('null_handling_example',         -- source table
                             'train_output',                  -- output model table
                             'id',                            -- id column
                             'response',                      -- dependent variable
                             'country, weather, city',        -- features
                             NULL,                            -- features to exclude
                             'gini',                          -- split criterion
                             NULL::text,                      -- no grouping
                             NULL::text,                      -- no weights, all observations treated equally
                             4,                               -- max depth
                             1,                               -- min split
                             1,                               -- number of bins per continuous variable
                             10,                              -- number of splits
                             NULL,                            -- pruning parameters
                             'null_as_category=true'          -- null handling
                             );
    SELECT cat_levels_in_text, cat_n_levels FROM train_output;
    
                cat_levels_in_text            | cat_n_levels
    ------------------------------------------+--------------
     {US,__NULL__,rainy,__NULL__,NY,__NULL__} | {2,2,2}
    
    View the summary table:
    \x on
    SELECT * FROM train_output_summary;
    
    -[ RECORD 1 ]---------------+-----------------------
    method                      | tree_train
    is_classification           | t
    source_table                | null_handling_example
    model_table                 | train_output
    id_col_name                 | id
    list_of_features            | country, weather, city
    list_of_features_to_exclude | None
    dependent_varname           | response
    independent_varnames        | country,weather,city
    cat_features                | country,weather,city
    con_features                |
    grouping_cols               | [NULL]
    num_all_groups              | 1
    num_failed_groups           | 0
    total_rows_processed        | 4
    total_rows_skipped          | 0
    dependent_var_levels        | "a","b","c","d"
    dependent_var_type          | text
    input_cp                    | 0
    independent_var_types       | text, text, text
    n_folds                     | 0
    null_proxy                  | __NULL__
    
  3. Predict for data not previously seen by assuming NULL value as the default:
    \x off
    DROP TABLE IF EXISTS table_test;
    CREATE TABLE table_test (
        id integer,
        country text,
        city text,
        weather text,
        expected_response text
    );
    INSERT INTO table_test VALUES
    (1,'IN','MUM','cloudy','a'),
    (2,'US','HOU','humid','b'),
    (3,'US','NY','sunny','c'),
    (4,'US','NY','rainy','d');
     
    DROP TABLE IF EXISTS prediction_results;
    SELECT madlib.tree_predict('train_output',
                               'table_test',
                               'prediction_results',
                               'response');
    SELECT s.id, expected_response, estimated_response
    FROM prediction_results p, table_test s
    WHERE s.id = p.id ORDER BY id;
    
     id | expected_response | estimated_response
    ----+-------------------+--------------------
      1 | a                 | a
      2 | b                 | b
      3 | c                 | c
      4 | d                 | d
    (4 rows)
    
    There is only training data for country 'US' so the response for country 'IN' is 'a', corresponding to a NULL (not 'US') country level. Likewise, any city in the 'US' that is not 'NY' will predict response 'b', corresponding to a NULL (not 'NY') city level.
  4. Display the decision tree in basic text format:
    SELECT madlib.tree_display('train_output', FALSE);
    
      -------------------------------------
     - Each node represented by 'id' inside ().
     - Each internal nodes has the split condition at the end, while each
         leaf node has a * at the end.
     - For each internal node (i), its child nodes are indented by 1 level
         with ids (2i+1) for True node and (2i+2) for False node.
     - Number of rows and average response value inside []. For a leaf node, this is the prediction.
     -------------------------------------
      (0)[1 1 1 1]  city in {NY}
        (1)[0 0 1 1]  weather in {rainy}
           (3)[0 0 0 1]  * --> "d"
           (4)[0 0 1 0]  * --> "c"
        (2)[1 1 0 0]  country in {US}
           (5)[0 1 0 0]  * --> "b"
           (6)[1 0 0 0]  * --> "a"
     -------------------------------------
    (1 row)
    

Literature
[1] Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Related Topics

File decision_tree.sql_in documenting the training function

Random Forest