MADlib  1.4.1
User Documentation
 All Files Functions Variables Groups
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.

This module provides an implementation of the C4.5 algorithm to grow decision trees.

The implementation supports:

Training Function

Run the training algorithm on the source data:

c45_train( split_criterion, 
           training_table_name, 
           result_tree_table_name, 
           validation_table_name, 
           continuous_feature_names, 
           feature_col_names, 
           id_col_name, 
           class_col_name, 
           confidence_level, 
           how2handle_missing_value max_tree_depth, 
           node_prune_threshold, 
           node_split_threshold,
           verbosity
         )

Arguments

split_criterion
The name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.
training_table_name
The name of the table/view with the source data. The training data is expected to be of the following form:
{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)
The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.
result_tree_table_name
The name of the table to contain the decision tree output. The table stores an abstract object (representing the model) used for further classification. It has the following columns:
id
tree_location
feature
probability
ebp_coeff
maxclass
scv
live
sample_size
parent_id
lmc_nid
lmc_fval
is_continuous
split_value
tid
dp_ids
validation_table_name
The name of the table/view that contains the validation set used for tree pruning. The default is NULL, in which case we will not do tree pruning.
continuous_feature_names
A comma-separated list of the names of features whose values are continuous. The default is null, which means there are no continuous features in the training table.
feature_col_names
A comma-separated list of the names of table columns, each of which defines a feature. The default value is null, which means all the columns in the training table, except columns named ‘id’ and ‘class’, will be used as features.
id_col_name
The name of the column containing an ID for each record.
class_col_name
The name of the column containing the labeled class.
confidence_level
A statistical confidence interval of the resubstitution error.
how2handle_missing_value
The way to handle missing value. The valid value is 'explicit' or 'ignore'.
max_tree_depth
Specifies the maximum number of levels in the result DT to avoid overgrown DTs.
node_prune_threshold
The minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if its value is 1, then the trained tree only has one node (the root node); if its value is 0, then no nodes will be pruned by this parameter.
node_split_threshold
The minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If it's value is 1, then the trained tree only has two levels, since only the root node can grow; if its value is 0, then trees can grow extensively.
verbosity
An integer greater than 0 means this function runs in verbose mode.

Classification Function

The classification function uses the learned model stored by the training function to create the classification results.

c45_classify( tree_table_name, 
              classification_table_name, 
              result_table_name
            )

Arguments

tree_table_name

The name of the table containing the trained model.

The data to classify is expected to be in the same form as the training data, except that it does not need a class column. The name of the table containing the data to classify.

result_table_name
The name of the output table.

Scoring Function
The scoring function scores the learned model against a validation data set.
c45_score( tree_table_name, 
           validation_table_name, 
           verbosity 
         );

This gives a ratio of correctly classified items in the validation set.

Display Tree Function

The display tree function displays the learned model in a human-readable format.

c45_display( tree_table_name
           );

Clean Tree Function

The clean tree function cleans up the learned model and all metadata.

  c45_clean( tree_table_name
           );

Examples
  1. Prepare an input table.
    SELECT * FROM golf_data ORDER BY id;
    
    Result:
     id | outlook  | temperature | humidity | windy  |    class     
     ---+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  2. Train the decision tree model. Run the c45_clean() function first to clean up any model and metadata from previous executions.
    SELECT * FROM madlib.c45_clean( 'trained_tree_infogain'
                                  );
    SELECT * FROM madlib.c45_train( 'infogain',
                                    'golf_data',                            
                                    'trained_tree_infogain',                
                                    null,                                   
                                    'temperature,humidity',                 
                                    'outlook,temperature,humidity,windy',
                                    'id',                                   
                                    'class',                                
                                    100,                                    
                                    'explicit',                             
                                    5,                                      
                                    0.001,                                  
                                    0.001,                                  
                                    0
                                  );
    
    Result:
                                        
     training_set_size | tree_nodes | tree_depth |  training_time  | split_criterion 
     ------------------+------------+------------+-----------------+-----------------
                    14 |          8 |          3 | 00:00:00.871805 | infogain
    (1 row)
    
  3. View the the tree model table.
    SELECT * FROM trained_tree_infogain ORDER BY id;
    
    Result:
     id | tree_location | feature |    probability    | ebp_coeff | maxclass |       scv         | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous   | split_value 
     ---+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------
      1 | {0}           |       3 | 0.642857142857143 |         1 |        2 | 0.171033941880327 |    0 |       14 |         0 |       2 |        1 | f               |            
      2 | {0,1}         |       4 |                 1 |         1 |        2 |                 0 |    0 |        4 |         1 |         |          | f               |            
      3 | {0,2}         |       4 |               0.6 |         1 |        2 | 0.673011667009257 |    0 |        5 |         1 |       5 |        1 | f               |            
      4 | {0,3}         |       2 |               0.6 |         1 |        1 | 0.673011667009257 |    0 |        5 |         1 |       7 |        1 | t               |          70
      5 | {0,2,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        3 |         3 |         |          | f               |            
      6 | {0,2,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        2 |         3 |         |          | f               |            
      7 | {0,3,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        2 |         4 |         |          | f               |            
      8 | {0,3,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        3 |         4 |         |          | f               |            
    (8 rows)
    
  4. Display the tree with a human readable format:
    SELECT madlib.c45_display('trained_tree_infogain');
    
    Result:
                                          c45_display                                      
     --------------------------------------------------------------------------------------
    Tree 1
        Root Node  : class(  Play)   num_elements(14)  predict_prob(0.642857142857143)          
             outlook:  = overcast : class( Play)   num_elements(4)  predict_prob(1)        
             outlook:  = rain : class( Play)   num_elements(5)  predict_prob(0.6)          
                 windy:  =  false : class( Play)   num_elements(3)  predict_prob(1)        
                 windy:  = true  : class(  Do not Play)   num_elements(2)  predict_prob(1)       
             outlook:  =  sunny      : class(  Do not Play)   num_elements(5)  predict_prob(0.6) 
                 humidity:  <= 70 : class( Play)   num_elements(2)  predict_prob(1)      
                 humidity:  > 70  : class(  Do not Play)   num_elements(3)  predict_prob(1)      
    (1 row)
    
  5. Classify some data with the learned model.
    SELECT * FROM madlib.c45_classify ( 'trained_tree_infogain',
                                        'golf_data', 
                                        'classification_result'
                                      );
    
    Result:
     input_set_size |    classification_time    
    ----------------+-----------------
                 14 | 00:00:00.247713
    (1 row)
    
  6. Check the classification results.
    SELECT t.id, t.outlook, t.temperature, t.humidity, t.windy, c.class 
    FROM   madlib.classification_result c, golf_data t 
    WHERE  t.id=c.id ORDER BY id;
    
    Result:
     id | outlook  | temperature | humidity | windy  |    class     
     ---+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  7. Score the data against a validation set.
    SELECT * FROM madlib.c45_score( 'trained_tree_infogain', 
                                    'golf_data_validation', 
                                    0)
                                  );
    
    Result:
     c45_score 
     ----------
          1
    (1 row)
    
  8. Clean up the tree and metadata.
     
    SELECT madlib.c45_clean( 'trained_tree_infogain'
                           );
    
    Result:
     c45_clean 
     ----------
     
    (1 row)
    

Literature

[1] http://en.wikipedia.org/wiki/C4.5_algorithm

Related Topics
File c45.sql_in documenting the SQL functions.