User Documentation
 All Files Functions Groups
+ Collaboration diagram for Random Forest:
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.
About:
A random forest (RF) is an ensemble classifier that consists of many decision trees and outputs the class that is voted by the majority of the individual trees.

It has the following well-known advantages:

This module provides an implementation of the random forest algorithm described in [1].

The implementation supports:

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)

The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.

The data to classify is expected to be of the same form as training data, except that it does not need a class column.

Usage:
  • Run the training algorithm on the source data:
    SELECT * FROM rf_train(
        'split_criterion',
        'training_table_name', 
        'result_rf_table_name', 
        'num_trees',
        'features_per_node',
        'sampling_percentage',
        'continuous_feature_names', 
        'feature_col_names',
        'id_col_name',
        'class_col_name'
        'how2handle_missing_value',
        'max_tree_depth',
        'node_prune_threshold',
        'node_split_threshold',
        'verbosity');
      
    This will create the decision tree output table storing an abstract object (representing the model) used for further classification. Column names:
        
     id | tree_location | feature |    probability    |    ebp_coeff     | maxclass |    split_gain     | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids 
    ----+---------------+---------+-------------------+------------------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+--------
                                                         ...
  • Run the classification function using the learned model:
    SELECT * FROM rf_classify(
        'rf_table_name', 
        'classification_table_name', 
        'result_table_name');
    This will create the result_table with the classification results.
     
  • Run the scoring function to score the learned model against a validation data set:
    SELECT * FROM rf_score(
        'rf_table_name',
        'validation_table_name',
        'verbosity');
    This will give a ratio of correctly classified items in the validation set.
     
  • Run the display tree function using the learned model:
    SELECT * FROM rf_display(
        'rf_table_name');
    This will display the trained trees in human readable format.
     
  • Run the clean tree function as below:
    SELECT * FROM rf_clean(
        'rf_table_name');
    This will clean up the learned model and all metadata.
     
Examples:
  1. Prepare an input table/view, e.g.:
    sql> select * from golf_data order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  2. Train the random forest, e.g.:
    sql> SELECT * FROM MADlib.rf_clean('trained_tree_infogain');
    sql> SELECT * FROM MADlib.rf_train(
           'infogain',                           -- split criterion_name
           'golf_data',                          -- input table name
           'trained_tree_infogain',              -- result tree name
           10,                                   -- number of trees
           NULL,                                 -- features_per_node
           0.632,                                -- sampling_percentage
           'temperature,humidity',               -- continuous feature names
           'outlook,temperature,humidity,windy', -- feature column names
           'id',                                 -- id column name
           'class',                              -- class column name
           'explicit',                           -- how to handle missing value
           10,                                   -- max tree depth
           0.0,                                  -- min percent mode
           0.0,                                  -- min percent split
           0                                     -- max split point
           0);                                   -- verbosity
     training_time  | num_of_samples | num_trees | features_per_node | num_tree_nodes | max_tree_depth | split_criterion |    acs_time     |    acc_time     |    olap_time    |   update_time   |    best_time    
    ----------------+--------------+-----------+-------------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
     00:00:03.60498 |           14 |        10 |                 3 |             71 |              6 | infogain        | 00:00:00.154991 | 00:00:00.404411 | 00:00:00.736876 | 00:00:00.374084 | 00:00:01.722658
    (1 row)
    
  3. Check the table records that keep the random forest:
    sql> select * from golf_tree order by tid,id;
     id | tree_location | feature |    probability    | ebp_coeff | maxclass |     split_gain     | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids 
    ----+---------------+---------+-------------------+-----------+----------+--------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+--------
      1 | {0}           |       3 | 0.777777777777778 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      24 |        1 | f               |             |   1 | 
     24 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |         1 |         |          | f               |             |   1 | {3}
     25 | {0,2}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         1 |         |          | f               |             |   1 | {3}
     26 | {0,3}         |       2 | 0.666666666666667 |         1 |        1 |  0.444444444444444 |    0 |        3 |         1 |      42 |        1 | t               |          70 |   1 | {3}
     42 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        26 |         |          | f               |             |   1 | 
     43 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        2 |        26 |         |          | f               |             |   1 | 
      2 | {0}           |       2 | 0.555555555555556 |         1 |        1 |   0.17636684303351 |    0 |        9 |         0 |      11 |        1 | t               |          65 |   2 | 
     11 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         2 |         |          | f               |             |   2 | 
     12 | {0,2}         |       4 | 0.714285714285714 |         1 |        1 |  0.217687074829932 |    0 |        7 |         2 |      44 |        1 | f               |             |   2 | 
     44 | {0,2,1}       |       3 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |        12 |      57 |        1 | f               |             |   2 | {4}
     45 | {0,2,2}       |       3 |                 1 |         1 |        1 |                  0 |    0 |        4 |        12 |         |          | f               |             |   2 | {4}
     57 | {0,2,1,1}     |       2 |                 1 |         1 |        2 |                  0 |    0 |        1 |        44 |         |          | t               |          78 |   2 | {4,3}
     58 | {0,2,1,2}     |       2 |                 1 |         1 |        2 |                  0 |    0 |        1 |        44 |         |          | t               |          96 |   2 | {4,3}
     59 | {0,2,1,3}     |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        44 |         |          | t               |          85 |   2 | {4,3}
      3 | {0}           |       2 | 0.777777777777778 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      27 |        1 | t               |          80 |   3 | 
     27 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        6 |         3 |         |          | f               |             |   3 | 
     28 | {0,2}         |       2 | 0.666666666666667 |         1 |        1 |  0.444444444444444 |    0 |        3 |         3 |      46 |        1 | t               |          90 |   3 | 
     46 | {0,2,1}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        2 |        28 |         |          | f               |             |   3 | 
     47 | {0,2,2}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        28 |         |          | f               |             |   3 | 
      4 | {0}           |       4 | 0.888888888888889 |         1 |        2 | 0.0493827160493827 |    0 |        9 |         0 |      13 |        1 | f               |             |   4 | 
     13 | {0,1}         |       3 |                 1 |         1 |        2 |                  0 |    0 |        6 |         4 |         |          | f               |             |   4 | {4}
     14 | {0,2}         |       3 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         4 |      48 |        1 | f               |             |   4 | {4}
     48 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        14 |         |          | t               |          90 |   4 | {4,3}
     49 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        14 |         |          | t               |          80 |   4 | {4,3}
      5 | {0}           |       2 | 0.888888888888889 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      29 |        1 | t               |          90 |   5 | 
     29 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        8 |         5 |         |          | f               |             |   5 | 
     30 | {0,2}         |       3 |                 1 |         1 |        1 |                  0 |    0 |        1 |         5 |         |          | f               |             |   5 | 
      6 | {0}           |       3 | 0.555555555555556 |         1 |        2 |  0.345679012345679 |    0 |        9 |         0 |      15 |        1 | f               |             |   6 | 
     15 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        3 |         6 |         |          | f               |             |   6 | {3}
     16 | {0,2}         |       4 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         6 |      51 |        1 | f               |             |   6 | {3}
     17 | {0,3}         |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |         6 |         |          | f               |             |   6 | {3}
     51 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        16 |         |          | t               |          96 |   6 | {3,4}
     52 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        16 |         |          | t               |          70 |   6 | {3,4}
      7 | {0}           |       4 | 0.666666666666667 |         1 |        2 |  0.253968253968254 |    0 |        9 |         0 |      31 |        1 | f               |             |   7 | 
     31 | {0,1}         |       2 | 0.857142857142857 |         1 |        2 |  0.102040816326531 |    0 |        7 |         7 |      36 |        1 | t               |          80 |   7 | {4}
     32 | {0,2}         |       3 |                 1 |         1 |        1 |                  0 |    0 |        2 |         7 |         |          | f               |             |   7 | {4}
     36 | {0,1,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        5 |        31 |         |          | f               |             |   7 | 
     37 | {0,1,2}       |       2 |               0.5 |         1 |        2 |                0.5 |    0 |        2 |        31 |      60 |        1 | t               |          95 |   7 | 
     60 | {0,1,2,1}     |       4 |                 1 |         1 |        1 |                  0 |    0 |        1 |        37 |         |          | f               |             |   7 | 
     61 | {0,1,2,2}     |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        37 |         |          | f               |             |   7 | 
      8 | {0}           |       3 | 0.777777777777778 |         1 |        2 | 0.0864197530864197 |    0 |        9 |         0 |      18 |        1 | f               |             |   8 | 
     18 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |         8 |         |          | f               |             |   8 | {3}
     19 | {0,2}         |       4 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         8 |      38 |        1 | f               |             |   8 | {3}
     20 | {0,3}         |       2 |               0.5 |         1 |        2 |                0.5 |    0 |        2 |         8 |      53 |        1 | t               |          70 |   8 | {3}
     38 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        19 |         |          | t               |          80 |   8 | {3,4}
     39 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        19 |         |          | t               |          80 |   8 | {3,4}
     53 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        20 |         |          | f               |             |   8 | 
     54 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        1 |        20 |         |          | f               |             |   8 | 
      9 | {0}           |       3 | 0.555555555555556 |         1 |        2 |  0.327160493827161 |    0 |        9 |         0 |      33 |        1 | f               |             |   9 | 
     33 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         9 |         |          | f               |             |   9 | {3}
     34 | {0,2}         |       4 |              0.75 |         1 |        2 |              0.375 |    0 |        4 |         9 |      55 |        1 | f               |             |   9 | {3}
     35 | {0,3}         |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |         9 |         |          | f               |             |   9 | {3}
     55 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        3 |        34 |         |          | t               |          96 |   9 | {3,4}
     56 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        34 |         |          | t               |          70 |   9 | {3,4}
     10 | {0}           |       3 | 0.666666666666667 |         1 |        2 |  0.277777777777778 |    0 |        9 |         0 |      21 |        1 | f               |             |  10 | 
     21 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        10 |         |          | f               |             |  10 | {3}
     22 | {0,2}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |        10 |         |          | f               |             |  10 | {3}
     23 | {0,3}         |       2 |              0.75 |         1 |        1 |              0.375 |    0 |        4 |        10 |      40 |        1 | t               |          70 |  10 | {3}
     40 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        23 |         |          | f               |             |  10 | 
     41 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |        23 |         |          | f               |             |  10 | 
    (60 rows)
    
  4. To display the random forest with human readable format:
    sql> select * from MADlib.rf_display('trained_tree_infogain');
                                                 rf_display                                              
    -----------------------------------------------------------------------------------------------------
                                                                                                         
     Tree 1                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             outlook:  = overcast  : class( Play)   num_elements(4)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(2)  predict_prob(1)                         
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(0.666666666666667) 
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(2)  predict_prob(1)               
     
                                                                                                         
     Tree 2                                                                                              
         Root Node  : class( Do not Play)   num_elements(9)  predict_prob(0.555555555555556)             
             humidity:  <= 65  : class( Play)   num_elements(2)  predict_prob(1)                         
             humidity:  > 65  : class( Do not Play)   num_elements(7)  predict_prob(0.714285714285714)   
                 windy:  =  false  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)     
                     outlook:  = overcast  : class( Play)   num_elements(1)  predict_prob(1)             
                     outlook:  = rain  : class( Play)   num_elements(1)  predict_prob(1)                 
                     outlook:  = sunny  : class( Do not Play)   num_elements(1)  predict_prob(1)         
                 windy:  =  true  : class( Do not Play)   num_elements(4)  predict_prob(1)               
     
                                                                                                         
     Tree 3                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             humidity:  <= 80  : class( Play)   num_elements(6)  predict_prob(1)                         
             humidity:  > 80  : class( Do not Play)   num_elements(3)  predict_prob(0.666666666666667)   
                 humidity:  <= 90  : class( Do not Play)   num_elements(2)  predict_prob(1)              
                 humidity:  > 90  : class( Play)   num_elements(1)  predict_prob(1)                      
     
                                                                                                         
     Tree 4                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.888888888888889)                    
             windy:  =  false  : class( Play)   num_elements(6)  predict_prob(1)                         
             windy:  =  true  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)          
                 outlook:  = overcast  : class( Play)   num_elements(2)  predict_prob(1)                 
                 outlook:  = rain  : class( Do not Play)   num_elements(1)  predict_prob(1)              
     
                                                                                                         
     Tree 5                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.888888888888889)                    
             humidity:  <= 90  : class( Play)   num_elements(8)  predict_prob(1)                         
             humidity:  > 90  : class( Do not Play)   num_elements(1)  predict_prob(1)                   
     
                                                                                                         
     Tree 6                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.555555555555556)                    
             outlook:  = overcast  : class( Play)   num_elements(3)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)         
                 windy:  =  false  : class( Play)   num_elements(2)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(1)                 
     
                                                                                                         
     Tree 7                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.666666666666667)                    
             windy:  =  false  : class( Play)   num_elements(7)  predict_prob(0.857142857142857)         
                 humidity:  <= 80  : class( Play)   num_elements(5)  predict_prob(1)                     
                 humidity:  > 80  : class( Play)   num_elements(2)  predict_prob(0.5)                    
                     humidity:  <= 95  : class( Do not Play)   num_elements(1)  predict_prob(1)          
                     humidity:  > 95  : class( Play)   num_elements(1)  predict_prob(1)                  
             windy:  =  true  : class( Do not Play)   num_elements(2)  predict_prob(1)                   
     
                                                                                                         
     Tree 8                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             outlook:  = overcast  : class( Play)   num_elements(4)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)         
                 windy:  =  false  : class( Play)   num_elements(2)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Play)   num_elements(2)  predict_prob(0.5)                      
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(1)  predict_prob(1)               
     
                                                                                                         
     Tree 9                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.555555555555556)                    
             outlook:  = overcast  : class( Play)   num_elements(2)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(4)  predict_prob(0.75)                      
                 windy:  =  false  : class( Play)   num_elements(3)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(1)                 
     
                                                                                                         
     Tree 10                                                                                             
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.666666666666667)                    
             outlook:  = overcast  : class( Play)   num_elements(1)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(4)  predict_prob(1)                         
             outlook:  = sunny  : class( Do not Play)   num_elements(4)  predict_prob(0.75)              
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(3)  predict_prob(1)               
     
    (10 rows)
    
  5. To classify data with the learned model:
    sql> select * from MADlib.rf_classify(
             'trained_tree_infogain',  -- name of the trained model
             'golf_data',              -- name of the table containing data to classify
             'classification_result'); -- name of the output table
     input_set_size | classification_time 
    ----------------+---------------------
                 14 | 00:00:02.215017
    (1 row)
    
  6. Check classification results:
    sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from
        classification_result c,golf_data t where t.id=c.id order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Do not Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  7. Score the data against a validation set:
    sql> select * from MADlib.rf_score(
            'trained_tree_infogain',
            'golf_data_validation',
            0);
         rf_score      
    -------------------
     0.928571428571429
    (1 row)
    
  8. Clean up the random forest and other auxiliary information:
    testdb=# select MADLIB_SCHEMA.rf_clean('trained_tree_infogain');
     rf_clean 
    ----------
     t
    (1 row)
    
Literature:

[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

[2] http://en.wikipedia.org/wiki/Discretization_of_continuous_features

See Also
File rf.sql_in documenting the SQL functions.