Contents

Warning: This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.

This module provides an implementation of the C4.5 algorithm to grow decision trees.

The implementation supports:

Building the decision tree
Multiple split critera, including: – Information Gain – Gini Coefficient – Gain Ratio
Decision tree Pruning
Decision tree classification/scoring
Decision tree display
Rule generation
Continuous and discrete features
Missing value handling

Training Function

Run the training algorithm on the source data:

c45_train( split_criterion, 
           training_table_name, 
           result_tree_table_name, 
           validation_table_name, 
           continuous_feature_names, 
           feature_col_names, 
           id_col_name, 
           class_col_name, 
           confidence_level, 
           how2handle_missing_value max_tree_depth, 
           node_prune_threshold, 
           node_split_threshold,
           verbosity
         )

Arguments

split_criterion

The name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.

training_table_name

The name of the table/view with the source data. The training data is expected to be of the following form:

{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)

The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.

result_tree_table_name

The name of the table to contain the decision tree output. The table stores an abstract object (representing the model) used for further classification. It has the following columns:

id
tree_location
feature
probability
ebp_coeff
maxclass
scv
live
sample_size
parent_id
lmc_nid
lmc_fval
is_continuous
split_value
tid
dp_ids

validation_table_name

The name of the table/view that contains the validation set used for tree pruning. The default is NULL, in which case we will not do tree pruning.

continuous_feature_names

A comma-separated list of the names of features whose values are continuous. The default is null, which means there are no continuous features in the training table.

feature_col_names

A comma-separated list of the names of table columns, each of which defines a feature. The default value is null, which means all the columns in the training table, except columns named ‘id’ and ‘class’, will be used as features.

id_col_name

The name of the column containing an ID for each record.

class_col_name

The name of the column containing the labeled class.

confidence_level

A statistical confidence interval of the resubstitution error.

how2handle_missing_value

The way to handle missing value. The valid value is 'explicit' or 'ignore'.

max_tree_depth

Specifies the maximum number of levels in the result DT to avoid overgrown DTs.

node_prune_threshold

The minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if its value is 1, then the trained tree only has one node (the root node); if its value is 0, then no nodes will be pruned by this parameter.

node_split_threshold

The minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If it's value is 1, then the trained tree only has two levels, since only the root node can grow; if its value is 0, then trees can grow extensively.

verbosity

An integer greater than 0 means this function runs in verbose mode.

Classification Function

The classification function uses the learned model stored by the training function to create the classification results.

c45_classify( tree_table_name, 
              classification_table_name, 
              result_table_name
            )

Arguments

tree_table_name

The name of the table containing the trained model.

The data to classify is expected to be in the same form as the training data, except that it does not need a class column. The name of the table containing the data to classify.

result_table_name

The name of the output table.

Scoring Function

The scoring function scores the learned model against a validation data set.

c45_score( tree_table_name, 
           validation_table_name, 
           verbosity 
         );

This gives a ratio of correctly classified items in the validation set.

Display Tree Function

The display tree function displays the learned model in a human-readable format.

c45_display( tree_table_name
           );

Clean Tree Function

The clean tree function cleans up the learned model and all metadata.

  c45_clean( tree_table_name
           );

Examples

Prepare an input table.

SELECT * FROM golf_data ORDER BY id;

Result:

 id | outlook  | temperature | humidity | windy  |    class     
 ---+----------+-------------+----------+--------+--------------
  1 | sunny    |          85 |       85 |  false |  Do not Play
  2 | sunny    |          80 |       90 |  true  |  Do not Play
  3 | overcast |          83 |       78 |  false |  Play
  4 | rain     |          70 |       96 |  false |  Play
  5 | rain     |          68 |       80 |  false |  Play
  6 | rain     |          65 |       70 |  true  |  Do not Play
  7 | overcast |          64 |       65 |  true  |  Play
  8 | sunny    |          72 |       95 |  false |  Do not Play
  9 | sunny    |          69 |       70 |  false |  Play
 10 | rain     |          75 |       80 |  false |  Play
 11 | sunny    |          75 |       70 |  true  |  Play
 12 | overcast |          72 |       90 |  true  |  Play
 13 | overcast |          81 |       75 |  false |  Play
 14 | rain     |          71 |       80 |  true  |  Do not Play
(14 rows)

Train the decision tree model. Run the c45_clean() function first to clean up any model and metadata from previous executions.

SELECT * FROM madlib.c45_clean( 'trained_tree_infogain'
                              );
SELECT * FROM madlib.c45_train( 'infogain',
                                'golf_data',                            
                                'trained_tree_infogain',                
                                null,                                   
                                'temperature,humidity',                 
                                'outlook,temperature,humidity,windy',
                                'id',                                   
                                'class',                                
                                100,                                    
                                'explicit',                             
                                5,                                      
                                0.001,                                  
                                0.001,                                  
                                0
                              );

Result:

                                    
 training_set_size | tree_nodes | tree_depth |  training_time  | split_criterion 
 ------------------+------------+------------+-----------------+-----------------
                14 |          8 |          3 | 00:00:00.871805 | infogain
(1 row)

View the the tree model table.

SELECT * FROM trained_tree_infogain ORDER BY id;

Result:

 id | tree_location | feature |    probability    | ebp_coeff | maxclass |       scv         | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous   | split_value 
 ---+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------
  1 | {0}           |       3 | 0.642857142857143 |         1 |        2 | 0.171033941880327 |    0 |       14 |         0 |       2 |        1 | f               |            
  2 | {0,1}         |       4 |                 1 |         1 |        2 |                 0 |    0 |        4 |         1 |         |          | f               |            
  3 | {0,2}         |       4 |               0.6 |         1 |        2 | 0.673011667009257 |    0 |        5 |         1 |       5 |        1 | f               |            
  4 | {0,3}         |       2 |               0.6 |         1 |        1 | 0.673011667009257 |    0 |        5 |         1 |       7 |        1 | t               |          70
  5 | {0,2,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        3 |         3 |         |          | f               |            
  6 | {0,2,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        2 |         3 |         |          | f               |            
  7 | {0,3,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        2 |         4 |         |          | f               |            
  8 | {0,3,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        3 |         4 |         |          | f               |            
(8 rows)

Display the tree with a human readable format:

SELECT madlib.c45_display('trained_tree_infogain');

Result:

                                      c45_display                                      
 --------------------------------------------------------------------------------------
Tree 1
    Root Node  : class(  Play)   num_elements(14)  predict_prob(0.642857142857143)          
         outlook:  = overcast : class( Play)   num_elements(4)  predict_prob(1)        
         outlook:  = rain : class( Play)   num_elements(5)  predict_prob(0.6)          
             windy:  =  false : class( Play)   num_elements(3)  predict_prob(1)        
             windy:  = true  : class(  Do not Play)   num_elements(2)  predict_prob(1)       
         outlook:  =  sunny      : class(  Do not Play)   num_elements(5)  predict_prob(0.6) 
             humidity:  <= 70 : class( Play)   num_elements(2)  predict_prob(1)      
             humidity:  > 70  : class(  Do not Play)   num_elements(3)  predict_prob(1)      
(1 row)

Classify some data with the learned model.

SELECT * FROM madlib.c45_classify ( 'trained_tree_infogain',
                                    'golf_data', 
                                    'classification_result'
                                  );

Result:

 input_set_size |    classification_time    
----------------+-----------------
             14 | 00:00:00.247713
(1 row)

Check the classification results.

SELECT t.id, t.outlook, t.temperature, t.humidity, t.windy, c.class 
FROM   madlib.classification_result c, golf_data t 
WHERE  t.id=c.id ORDER BY id;

Result:

 id | outlook  | temperature | humidity | windy  |    class     
 ---+----------+-------------+----------+--------+--------------
  1 | sunny    |          85 |       85 |  false |  Do not Play
  2 | sunny    |          80 |       90 |  true  |  Do not Play
  3 | overcast |          83 |       78 |  false |  Play
  4 | rain     |          70 |       96 |  false |  Play
  5 | rain     |          68 |       80 |  false |  Play
  6 | rain     |          65 |       70 |  true  |  Do not Play
  7 | overcast |          64 |       65 |  true  |  Play
  8 | sunny    |          72 |       95 |  false |  Do not Play
  9 | sunny    |          69 |       70 |  false |  Play
 10 | rain     |          75 |       80 |  false |  Play
 11 | sunny    |          75 |       70 |  true  |  Play
 12 | overcast |          72 |       90 |  true  |  Play
 13 | overcast |          81 |       75 |  false |  Play
 14 | rain     |          71 |       80 |  true  |  Do not Play
(14 rows)

Score the data against a validation set.

SELECT * FROM madlib.c45_score( 'trained_tree_infogain', 
                                'golf_data_validation', 
                                0)
                              );

Result:

 c45_score 
 ----------
      1
(1 row)

Clean up the tree and metadata.

 
SELECT madlib.c45_clean( 'trained_tree_infogain'
                       );

Result:

 c45_clean 
 ----------
 
(1 row)

Literature

[1] http://en.wikipedia.org/wiki/C4.5_algorithm

Related Topics: File c45.sql_in documenting the SQL functions.