User Documentation
Random Forest
+ Collaboration diagram for Random Forest:
About:
A random forest (RF) is an ensemble classifier that consists of many decision trees and outputs the class that is voted by the majority of the individual trees.

It has the following well-known advantages:

This module provides an implementation of the random forest algorithm described in [1].

The implementation supports:

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} trainingSource (
    ...
    id INT|BIGINT,
    feature1 SUPPORTED_DATA_TYPE,
    feature2 SUPPORTED_DATA_TYPE,
    feature3 SUPPORTED_DATA_TYPE,
    ....................
    featureN SUPPORTED_DATA_TYPE,
    class    SUPPORTED_DATA_TYPE,
    ...
)

The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.

The data to classify is expected to be of the same form as training data, except that it does not need a class column.

Usage:
Examples:
  1. Prepare an input table/view, e.g.:
    sql> select * from golf_data order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  2. Train the random forest, e.g.:
    sql> SELECT * FROM MADlib.rf_clean('trained_tree_infogain');
    sql> SELECT * FROM MADlib.rf_train(
    	   'infogain',                      	 -- split criterion_name
    	   'golf_data',                      	 -- input table name
    	   'trained_tree_infogain',         	 -- result tree name
    	   10,                                   -- number of trees
           NULL,                                 -- features_per_node
           0.632,                                -- sampling_percentage
    	   'temperature,humidity',               -- continuous feature names
    	   'outlook,temperature,humidity,windy', -- feature column names
    	   'id',                                 -- id column name
    	   'class',                              -- class column name
    	   'explicit',                           -- how to handle missing value
    	   10,                                   -- max tree depth
    	   0.0,                                  -- min percent mode
    	   0.0,                                  -- min percent split
           0                                     -- max split point
    	   0);                                   -- verbosity
     training_time  | num_of_samples | num_trees | features_per_node | num_tree_nodes | max_tree_depth | split_criterion |    acs_time     |    acc_time     |    olap_time    |   update_time   |    best_time    
    ----------------+--------------+-----------+-------------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
     00:00:03.60498 |           14 |        10 |                 3 |             71 |              6 | infogain        | 00:00:00.154991 | 00:00:00.404411 | 00:00:00.736876 | 00:00:00.374084 | 00:00:01.722658
    (1 row)
    
  3. Check the table records that keep the random forest:
    sql> select * from golf_tree order by tid,id;
     id | tree_location | feature |    probability    | ebp_coeff | maxclass |     split_gain     | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids 
    ----+---------------+---------+-------------------+-----------+----------+--------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+--------
      1 | {0}           |       3 | 0.777777777777778 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      24 |        1 | f               |             |   1 | 
     24 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |         1 |         |          | f               |             |   1 | {3}
     25 | {0,2}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         1 |         |          | f               |             |   1 | {3}
     26 | {0,3}         |       2 | 0.666666666666667 |         1 |        1 |  0.444444444444444 |    0 |        3 |         1 |      42 |        1 | t               |          70 |   1 | {3}
     42 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        26 |         |          | f               |             |   1 | 
     43 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        2 |        26 |         |          | f               |             |   1 | 
      2 | {0}           |       2 | 0.555555555555556 |         1 |        1 |   0.17636684303351 |    0 |        9 |         0 |      11 |        1 | t               |          65 |   2 | 
     11 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         2 |         |          | f               |             |   2 | 
     12 | {0,2}         |       4 | 0.714285714285714 |         1 |        1 |  0.217687074829932 |    0 |        7 |         2 |      44 |        1 | f               |             |   2 | 
     44 | {0,2,1}       |       3 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |        12 |      57 |        1 | f               |             |   2 | {4}
     45 | {0,2,2}       |       3 |                 1 |         1 |        1 |                  0 |    0 |        4 |        12 |         |          | f               |             |   2 | {4}
     57 | {0,2,1,1}     |       2 |                 1 |         1 |        2 |                  0 |    0 |        1 |        44 |         |          | t               |          78 |   2 | {4,3}
     58 | {0,2,1,2}     |       2 |                 1 |         1 |        2 |                  0 |    0 |        1 |        44 |         |          | t               |          96 |   2 | {4,3}
     59 | {0,2,1,3}     |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        44 |         |          | t               |          85 |   2 | {4,3}
      3 | {0}           |       2 | 0.777777777777778 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      27 |        1 | t               |          80 |   3 | 
     27 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        6 |         3 |         |          | f               |             |   3 | 
     28 | {0,2}         |       2 | 0.666666666666667 |         1 |        1 |  0.444444444444444 |    0 |        3 |         3 |      46 |        1 | t               |          90 |   3 | 
     46 | {0,2,1}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        2 |        28 |         |          | f               |             |   3 | 
     47 | {0,2,2}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        28 |         |          | f               |             |   3 | 
      4 | {0}           |       4 | 0.888888888888889 |         1 |        2 | 0.0493827160493827 |    0 |        9 |         0 |      13 |        1 | f               |             |   4 | 
     13 | {0,1}         |       3 |                 1 |         1 |        2 |                  0 |    0 |        6 |         4 |         |          | f               |             |   4 | {4}
     14 | {0,2}         |       3 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         4 |      48 |        1 | f               |             |   4 | {4}
     48 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        14 |         |          | t               |          90 |   4 | {4,3}
     49 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        14 |         |          | t               |          80 |   4 | {4,3}
      5 | {0}           |       2 | 0.888888888888889 |         1 |        2 |  0.197530864197531 |    0 |        9 |         0 |      29 |        1 | t               |          90 |   5 | 
     29 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        8 |         5 |         |          | f               |             |   5 | 
     30 | {0,2}         |       3 |                 1 |         1 |        1 |                  0 |    0 |        1 |         5 |         |          | f               |             |   5 | 
      6 | {0}           |       3 | 0.555555555555556 |         1 |        2 |  0.345679012345679 |    0 |        9 |         0 |      15 |        1 | f               |             |   6 | 
     15 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        3 |         6 |         |          | f               |             |   6 | {3}
     16 | {0,2}         |       4 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         6 |      51 |        1 | f               |             |   6 | {3}
     17 | {0,3}         |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |         6 |         |          | f               |             |   6 | {3}
     51 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        16 |         |          | t               |          96 |   6 | {3,4}
     52 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        16 |         |          | t               |          70 |   6 | {3,4}
      7 | {0}           |       4 | 0.666666666666667 |         1 |        2 |  0.253968253968254 |    0 |        9 |         0 |      31 |        1 | f               |             |   7 | 
     31 | {0,1}         |       2 | 0.857142857142857 |         1 |        2 |  0.102040816326531 |    0 |        7 |         7 |      36 |        1 | t               |          80 |   7 | {4}
     32 | {0,2}         |       3 |                 1 |         1 |        1 |                  0 |    0 |        2 |         7 |         |          | f               |             |   7 | {4}
     36 | {0,1,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        5 |        31 |         |          | f               |             |   7 | 
     37 | {0,1,2}       |       2 |               0.5 |         1 |        2 |                0.5 |    0 |        2 |        31 |      60 |        1 | t               |          95 |   7 | 
     60 | {0,1,2,1}     |       4 |                 1 |         1 |        1 |                  0 |    0 |        1 |        37 |         |          | f               |             |   7 | 
     61 | {0,1,2,2}     |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        37 |         |          | f               |             |   7 | 
      8 | {0}           |       3 | 0.777777777777778 |         1 |        2 | 0.0864197530864197 |    0 |        9 |         0 |      18 |        1 | f               |             |   8 | 
     18 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |         8 |         |          | f               |             |   8 | {3}
     19 | {0,2}         |       4 | 0.666666666666667 |         1 |        2 |  0.444444444444444 |    0 |        3 |         8 |      38 |        1 | f               |             |   8 | {3}
     20 | {0,3}         |       2 |               0.5 |         1 |        2 |                0.5 |    0 |        2 |         8 |      53 |        1 | t               |          70 |   8 | {3}
     38 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        2 |        19 |         |          | t               |          80 |   8 | {3,4}
     39 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        19 |         |          | t               |          80 |   8 | {3,4}
     53 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        20 |         |          | f               |             |   8 | 
     54 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        1 |        20 |         |          | f               |             |   8 | 
      9 | {0}           |       3 | 0.555555555555556 |         1 |        2 |  0.327160493827161 |    0 |        9 |         0 |      33 |        1 | f               |             |   9 | 
     33 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        2 |         9 |         |          | f               |             |   9 | {3}
     34 | {0,2}         |       4 |              0.75 |         1 |        2 |              0.375 |    0 |        4 |         9 |      55 |        1 | f               |             |   9 | {3}
     35 | {0,3}         |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |         9 |         |          | f               |             |   9 | {3}
     55 | {0,2,1}       |       2 |                 1 |         1 |        2 |                  0 |    0 |        3 |        34 |         |          | t               |          96 |   9 | {3,4}
     56 | {0,2,2}       |       2 |                 1 |         1 |        1 |                  0 |    0 |        1 |        34 |         |          | t               |          70 |   9 | {3,4}
     10 | {0}           |       3 | 0.666666666666667 |         1 |        2 |  0.277777777777778 |    0 |        9 |         0 |      21 |        1 | f               |             |  10 | 
     21 | {0,1}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        10 |         |          | f               |             |  10 | {3}
     22 | {0,2}         |       4 |                 1 |         1 |        2 |                  0 |    0 |        4 |        10 |         |          | f               |             |  10 | {3}
     23 | {0,3}         |       2 |              0.75 |         1 |        1 |              0.375 |    0 |        4 |        10 |      40 |        1 | t               |          70 |  10 | {3}
     40 | {0,3,1}       |       4 |                 1 |         1 |        2 |                  0 |    0 |        1 |        23 |         |          | f               |             |  10 | 
     41 | {0,3,2}       |       4 |                 1 |         1 |        1 |                  0 |    0 |        3 |        23 |         |          | f               |             |  10 | 
    (60 rows)
    
  4. To display the random forest with human readable format:
    sql> select * from MADlib.rf_display('trained_tree_infogain');
                                                 rf_display                                              
    -----------------------------------------------------------------------------------------------------
                                                                                                         
     Tree 1                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             outlook:  = overcast  : class( Play)   num_elements(4)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(2)  predict_prob(1)                         
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(0.666666666666667) 
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(2)  predict_prob(1)               
     
                                                                                                         
     Tree 2                                                                                              
         Root Node  : class( Do not Play)   num_elements(9)  predict_prob(0.555555555555556)             
             humidity:  <= 65  : class( Play)   num_elements(2)  predict_prob(1)                         
             humidity:  > 65  : class( Do not Play)   num_elements(7)  predict_prob(0.714285714285714)   
                 windy:  =  false  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)     
                     outlook:  = overcast  : class( Play)   num_elements(1)  predict_prob(1)             
                     outlook:  = rain  : class( Play)   num_elements(1)  predict_prob(1)                 
                     outlook:  = sunny  : class( Do not Play)   num_elements(1)  predict_prob(1)         
                 windy:  =  true  : class( Do not Play)   num_elements(4)  predict_prob(1)               
     
                                                                                                         
     Tree 3                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             humidity:  <= 80  : class( Play)   num_elements(6)  predict_prob(1)                         
             humidity:  > 80  : class( Do not Play)   num_elements(3)  predict_prob(0.666666666666667)   
                 humidity:  <= 90  : class( Do not Play)   num_elements(2)  predict_prob(1)              
                 humidity:  > 90  : class( Play)   num_elements(1)  predict_prob(1)                      
     
                                                                                                         
     Tree 4                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.888888888888889)                    
             windy:  =  false  : class( Play)   num_elements(6)  predict_prob(1)                         
             windy:  =  true  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)          
                 outlook:  = overcast  : class( Play)   num_elements(2)  predict_prob(1)                 
                 outlook:  = rain  : class( Do not Play)   num_elements(1)  predict_prob(1)              
     
                                                                                                         
     Tree 5                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.888888888888889)                    
             humidity:  <= 90  : class( Play)   num_elements(8)  predict_prob(1)                         
             humidity:  > 90  : class( Do not Play)   num_elements(1)  predict_prob(1)                   
     
                                                                                                         
     Tree 6                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.555555555555556)                    
             outlook:  = overcast  : class( Play)   num_elements(3)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)         
                 windy:  =  false  : class( Play)   num_elements(2)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(1)                 
     
                                                                                                         
     Tree 7                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.666666666666667)                    
             windy:  =  false  : class( Play)   num_elements(7)  predict_prob(0.857142857142857)         
                 humidity:  <= 80  : class( Play)   num_elements(5)  predict_prob(1)                     
                 humidity:  > 80  : class( Play)   num_elements(2)  predict_prob(0.5)                    
                     humidity:  <= 95  : class( Do not Play)   num_elements(1)  predict_prob(1)          
                     humidity:  > 95  : class( Play)   num_elements(1)  predict_prob(1)                  
             windy:  =  true  : class( Do not Play)   num_elements(2)  predict_prob(1)                   
     
                                                                                                         
     Tree 8                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.777777777777778)                    
             outlook:  = overcast  : class( Play)   num_elements(4)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(3)  predict_prob(0.666666666666667)         
                 windy:  =  false  : class( Play)   num_elements(2)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Play)   num_elements(2)  predict_prob(0.5)                      
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(1)  predict_prob(1)               
     
                                                                                                         
     Tree 9                                                                                              
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.555555555555556)                    
             outlook:  = overcast  : class( Play)   num_elements(2)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(4)  predict_prob(0.75)                      
                 windy:  =  false  : class( Play)   num_elements(3)  predict_prob(1)                     
                 windy:  =  true  : class( Do not Play)   num_elements(1)  predict_prob(1)               
             outlook:  = sunny  : class( Do not Play)   num_elements(3)  predict_prob(1)                 
     
                                                                                                         
     Tree 10                                                                                             
         Root Node  : class( Play)   num_elements(9)  predict_prob(0.666666666666667)                    
             outlook:  = overcast  : class( Play)   num_elements(1)  predict_prob(1)                     
             outlook:  = rain  : class( Play)   num_elements(4)  predict_prob(1)                         
             outlook:  = sunny  : class( Do not Play)   num_elements(4)  predict_prob(0.75)              
                 humidity:  <= 70  : class( Play)   num_elements(1)  predict_prob(1)                     
                 humidity:  > 70  : class( Do not Play)   num_elements(3)  predict_prob(1)               
     
    (10 rows)
    
  5. To classify data with the learned model:
    sql> select * from MADlib.rf_classify(
             'trained_tree_infogain',  -- name of the trained model
             'golf_data',              -- name of the table containing data to classify
             'classification_result'); -- name of the output table
     input_set_size | classification_time 
    ----------------+---------------------
                 14 | 00:00:02.215017
    (1 row)
    
  6. Check classification results:
    sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from
        classification_result c,golf_data t where t.id=c.id order by id;
     id | outlook  | temperature | humidity | windy  |    class     
    ----+----------+-------------+----------+--------+--------------
      1 | sunny    |          85 |       85 |  false |  Do not Play
      2 | sunny    |          80 |       90 |  true  |  Do not Play
      3 | overcast |          83 |       78 |  false |  Play
      4 | rain     |          70 |       96 |  false |  Play
      5 | rain     |          68 |       80 |  false |  Play
      6 | rain     |          65 |       70 |  true  |  Do not Play
      7 | overcast |          64 |       65 |  true  |  Play
      8 | sunny    |          72 |       95 |  false |  Do not Play
      9 | sunny    |          69 |       70 |  false |  Play
     10 | rain     |          75 |       80 |  false |  Play
     11 | sunny    |          75 |       70 |  true  |  Do not Play
     12 | overcast |          72 |       90 |  true  |  Play
     13 | overcast |          81 |       75 |  false |  Play
     14 | rain     |          71 |       80 |  true  |  Do not Play
    (14 rows)
    
  7. Score the data against a validation set:
    sql> select * from MADlib.rf_score(
            'trained_tree_infogain',
            'golf_data_validation',
            0);
         rf_score      
    -------------------
     0.928571428571429
    (1 row)
    
  8. Clean up the random forest and other auxiliary information:
    testdb=# select MADLIB_SCHEMA.rf_clean('trained_tree_infogain');
     rf_clean 
    ----------
     t
    (1 row)
    
Literature:

[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

[2] http://en.wikipedia.org/wiki/Discretization_of_continuous_features

See also:
File rf.sql_in documenting the SQL functions.