User Documentation
 All Files Functions Groups
+ Collaboration diagram for Decision Tree:
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.
About:

This module provides an implementation of the C4.5 implementation to grow decision trees.

The implementation supports:

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)

The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.

The data to classify is expected to be of the same form as training data, except that it does not need a class column.

Usage:
  • Run the training algorithm on the source data:
    SELECT * FROM c45_train(
        'split_criterion',
        'training_table_name', 
        'result_tree_table_name', 
        'validation_table_name',
        'continuous_feature_names',
        'feature_col_names',
        'id_col_name', 
        'class_col_name',
        'confidence_level',
        'how2handle_missing_value'
        'max_tree_depth',
        'node_prune_threshold',
        'node_split_threshold'
        'verbosity');
      
    This will create the decision tree output table storing an abstract object (representing the model) used for further classification. Column names:
        
     id | tree_location | feature |    probability    |    ebp_coeff     | maxclass |        scv        | live | sample_size | parent_id | lmc_nid | lmc_fval |  is_continuous  | split_value | tid | dp_ids 
    ----+---------------+---------+-------------------+------------------+----------+-------------------+------+-----------+-----------+---------+----------+-----------------+-------------+-----+--------
                                                         ...
  • Run the classification function using the learned model:
    SELECT * FROM c45_classify(
        'tree_table_name', 
        'classification_table_name', 
        'result_table_name');
    This will create the result_table with the classification results.
     
  • Run the scorinf function to score the learned model against a validation data set:
    SELECT * FROM c45_score(
        'tree_table_name',
        'validation_table_name',
        'verbosity');
    This will give a ratio of correctly classified items in the validation set.
     
  • Run the display tree function using the learned model:
    SELECT * FROM c45_display(
        'tree_table_name');
    This will display the trained tree in human readable format.
     
  • Run the clean tree function as below:
    SELECT * FROM c45_clean(
        'tree_table_name');
    This will clean up the learned model and all metadata.
     
Examples:
  1. Prepare an input table/view, e.g.:
    sql> select * from golf_data order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
  2. Train the decision tree model, e.g.:
    sql> SELECT * FROM MADlib.c45_clean('trained_tree_infogain');
    sql> SELECT * FROM MADlib.c45_train(
           'infogain',                              -- split criterion_name
           'golf_data',                             -- input table name
           'trained_tree_infogain',                 -- result tree name
           null,                                    -- validation table name
           'temperature,humidity',                  -- continuous feature names
           'outlook,temperature,humidity,windy',    -- feature column names
           'id',                                    -- id column name
           'class',                                 -- class column name
           100,                                     -- confidence level
           'explicit',                              -- missing value preparation
           5,                                       -- max tree depth
           0.001,                                   -- min percent mode
           0.001,                                   -- min percent split
           0);                                      -- verbosity
     training_set_size | tree_nodes | tree_depth |  training_time  | split_criterion 
    -------------------+------------+------------+-----------------+-----------------
                    14 |          8 |          3 | 00:00:00.871805 | infogain
    (1 row)
    
  3. Check few rows from the tree model table:
    sql> select * from trained_tree_infogain order by id;
     id | tree_location | feature |    probability    | ebp_coeff | maxclass |       scv         | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous   | split_value 
    ----+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------
      1 | {0}           |       3 | 0.642857142857143 |         1 |        2 | 0.171033941880327 |    0 |       14 |         0 |       2 |        1 | f               |            
      2 | {0,1}         |       4 |                 1 |         1 |        2 |                 0 |    0 |        4 |         1 |         |          | f               |            
      3 | {0,2}         |       4 |               0.6 |         1 |        2 | 0.673011667009257 |    0 |        5 |         1 |       5 |        1 | f               |            
      4 | {0,3}         |       2 |               0.6 |         1 |        1 | 0.673011667009257 |    0 |        5 |         1 |       7 |        1 | t               |          70
      5 | {0,2,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        3 |         3 |         |          | f               |            
      6 | {0,2,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        2 |         3 |         |          | f               |            
      7 | {0,3,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        2 |         4 |         |          | f               |            
      8 | {0,3,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        3 |         4 |         |          | f               |            
    (8 rows)
  4. To display the tree with human readable format:
    sql> select MADlib.c45_display('trained_tree_infogain');
                                          c45_display                                      
    ---------------------------------------------------------------------------------------
    Tree 1
        Root Node  : class(  Play)   num_elements(14)  predict_prob(0.642857142857143)          
             outlook:  = overcast : class( Play)   num_elements(4)  predict_prob(1)        
             outlook:  = rain : class( Play)   num_elements(5)  predict_prob(0.6)          
                 windy:  =  false : class( Play)   num_elements(3)  predict_prob(1)        
                 windy:  = true  : class(  Do not Play)   num_elements(2)  predict_prob(1)       
             outlook:  =  sunny      : class(  Do not Play)   num_elements(5)  predict_prob(0.6) 
                 humidity:  <= 70 : class( Play)   num_elements(2)  predict_prob(1)      
                 humidity:  > 70  : class(  Do not Play)   num_elements(3)  predict_prob(1)      
    (1 row)
  5. To classify data with the learned model:
    sql> select * from MADlib.c45_classify
             'trained_tree_infogain',  -- name of the trained model
             'golf_data',              -- name of the table containing data to classify
             'classification_result'); -- name of the output table
     input_set_size |    classification_time    
    ----------------+-----------------
                 14 | 00:00:00.247713
    (1 row)
    
  6. Check classification results:
    sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from
        MADlib.classification_result c,golf_data t where t.id=c.id order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  7. Score the data against a validation set:
    sql> select * from MADlib.c45_score(
            'trained_tree_infogain',
            'golf_data_validation',
            0);
     c45_score 
    -----------
          1
    (1 row)
    
  8. clean up the tree and metadata:
    testdb=# select MADLIB_SCHEMA.c45_clean('trained_tree_infogain');
     c45_clean 
    -----------
     
    (1 row)
    
Literature:

[1] http://en.wikipedia.org/wiki/C4.5_algorithm

See Also
File c45.sql_in documenting the SQL functions.