User Documentation
Decision Tree
+ Collaboration diagram for Decision Tree:
About:

This module provides an implementation of the C4.5 implementation to grow decision trees.

The implementation supports:

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)

The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.

The data to classify is expected to be of the same form as training data, except that it does not need a class column.

Usage:
Examples:
  1. Prepare an input table/view, e.g.:
    sql> select * from golf_data order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
    
  2. Train the decision tree model, e.g.:
    sql> SELECT * FROM MADlib.c45_clean('trained_tree_infogain');
    sql> SELECT * FROM MADlib.c45_train(
    	   'infogain',                       		-- split criterion_name
    	   'golf_data',                      		-- input table name
    	   'trained_tree_infogain',          		-- result tree name
    	   null,                             		-- validation table name
    	   'temperature,humidity',           		-- continuous feature names
    	   'outlook,temperature,humidity,windy',	-- feature column names
    	   'id',                             		-- id column name
    	   'class',                          		-- class column name
    	   100,                              		-- confidence level
    	   'explicit',                       		-- missing value preparation
    	   5,                                		-- max tree depth
    	   0.001,                            		-- min percent mode
    	   0.001,                            		-- min percent split
    	   0);                               		-- verbosity
     training_set_size | tree_nodes | tree_depth |  training_time  | split_criterion 
    -------------------+------------+------------+-----------------+-----------------
                    14 |          8 |          3 | 00:00:00.871805 | infogain
    (1 row)
    
  3. Check few rows from the tree model table:
    sql> select * from trained_tree_infogain order by id;
     id | tree_location | feature |    probability    | ebp_coeff | maxclass |       scv         | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous   | split_value 
    ----+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------
      1 | {0}           |       3 | 0.642857142857143 |         1 |        2 | 0.171033941880327 |    0 |       14 |         0 |       2 |        1 | f               |            
      2 | {0,1}         |       4 |                 1 |         1 |        2 |                 0 |    0 |        4 |         1 |         |          | f               |            
      3 | {0,2}         |       4 |               0.6 |         1 |        2 | 0.673011667009257 |    0 |        5 |         1 |       5 |        1 | f               |            
      4 | {0,3}         |       2 |               0.6 |         1 |        1 | 0.673011667009257 |    0 |        5 |         1 |       7 |        1 | t               |          70
      5 | {0,2,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        3 |         3 |         |          | f               |            
      6 | {0,2,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        2 |         3 |         |          | f               |            
      7 | {0,3,1}       |       4 |                 1 |         1 |        2 |                 0 |    0 |        2 |         4 |         |          | f               |            
      8 | {0,3,2}       |       4 |                 1 |         1 |        1 |                 0 |    0 |        3 |         4 |         |          | f               |            
    (8 rows)
    
    
  4. To display the tree with human readable format:
    sql> select MADlib.c45_display('trained_tree_infogain');
                                          c45_display                                      
    ---------------------------------------------------------------------------------------
    Tree 1
        Root Node  : class(  Play)   num_elements(14)  predict_prob(0.642857142857143)          
             outlook:  = overcast : class( Play)   num_elements(4)  predict_prob(1)        
             outlook:  = rain : class( Play)   num_elements(5)  predict_prob(0.6)          
                 windy:  =  false : class( Play)   num_elements(3)  predict_prob(1)        
                 windy:  = true  : class(  Do not Play)   num_elements(2)  predict_prob(1)       
             outlook:  =  sunny      : class(  Do not Play)   num_elements(5)  predict_prob(0.6) 
                 humidity:  <= 70 : class( Play)   num_elements(2)  predict_prob(1)      
                 humidity:  > 70  : class(  Do not Play)   num_elements(3)  predict_prob(1)      
    (1 row)
    
    
  5. To classify data with the learned model:
    sql> select * from MADlib.c45_classify
             'trained_tree_infogain',  -- name of the trained model
             'golf_data',              -- name of the table containing data to classify
             'classification_result'); -- name of the output table
     input_set_size |    classification_time    
    ----------------+-----------------
                 14 | 00:00:00.247713
    (1 row)
    
  6. Check classification results:
    sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from
        MADlib.classification_result c,golf_data t where t.id=c.id order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  7. Score the data against a validation set:
    sql> select * from MADlib.c45_score(
            'trained_tree_infogain',
            'golf_data_validation',
            0);
     c45_score 
    -----------
          1
    (1 row)
    
  8. clean up the tree and metadata:
    testdb=# select MADLIB_SCHEMA.c45_clean('trained_tree_infogain');
     c45_clean 
    -----------
     
    (1 row)
    
Literature:

[1] http://en.wikipedia.org/wiki/C4.5_algorithm

See also:
File c45.sql_in documenting the SQL functions.