A random forest (RF) is an ensemble classifier that consists of many decision trees and outputs the class that is voted by the majority of the individual trees.
It has the following well-known advantages:
This module provides an implementation of the random forest algorithm described in [1].
The implementation supports:
The data to classify is expected to be of the same form as training data, except that it does not need a class column.
Run the training algorithm on the source data.
rf_train( split_criterion, training_table_name, result_rf_table_name, num_trees, features_per_node, sampling_percentage, continuous_feature_names, feature_col_names, id_col_name, class_col_name, how2handle_missing_value, max_tree_depth, node_prune_threshold, node_split_threshold, verbosity )
Arguments
The name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.
The name of the table/view with the training data. It can't be NULL and must exist.
The training data is expected to be of the following form:
{TABLE|VIEW} trainingSource ( ... id INT|BIGINT, feature1 SUPPORTED_DATA_TYPE, feature2 SUPPORTED_DATA_TYPE, feature3 SUPPORTED_DATA_TYPE, .................... featureN SUPPORTED_DATA_TYPE, class SUPPORTED_DATA_TYPE, ... )
SUPPORTED_DATA_TYPE can be any of the following: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.
The name of the table where the resulting trees are stored. It can not be NULL and must not exist.
The output table stores an abstract object (representing the model) used for further classification. The table has the following columns:
id | |
---|---|
tree_location | |
feature | |
probability | |
ebp_coeff | |
maxclass | |
split_gain | |
live | |
cat_size | |
parent_id | |
lmc_nid | |
lmc_fval | |
is_feature_cont | |
split_value | |
tid | |
dp_ids |
The classification function creates the result_table with the classification results.
rf_classify( rf_table_name, classification_table_name, result_table_name)
The scoring function gives a ratio of correctly classified items in the validation data set.
rf_score( rf_table_name, validation_table_name, verbosity)
The display tree function displays the trained trees in a human-readable format.
rf_display( rf_table_name )
The clean tree function cleans up the learned model and metadata.
rf_clean( rf_table_name )
SELECT * FROM golf_data ORDER BY id;Result:
id | outlook | temperature | humidity | windy | class ---+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
SELECT * FROM madlib.rf_clean( 'trained_tree_infogain' ); SELECT * FROM madlib.rf_train( 'infogain', 'golf_data', 'trained_tree_infogain', 10, NULL, 0.632, 'temperature,humidity', 'outlook,temperature,humidity,windy', 'id', 'class', 'explicit', 10, 0.0, 0.0, 0, 0 );Result:
training_time | num_of_samples | num_trees | features_per_node | num_tree_nodes | max_tree_depth | split_criterion | acs_time | acc_time | olap_time | update_time | best_time ---------------+--------------+-----------+-------------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------- 00:00:03.60498 | 14 | 10 | 3 | 71 | 6 | infogain | 00:00:00.154991 | 00:00:00.404411 | 00:00:00.736876 | 00:00:00.374084 | 00:00:01.722658 (1 row)
SELECT * FROM golf_tree ORDER BY tid, id;
id | tree_location | feature | probability | ebp_coeff | maxclass | split_gain | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids ---+---------------+---------+-------------------+-----------+----------+--------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+-------- 1 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 24 | 1 | f | | 1 | 24 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f | | 1 | {3} 25 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 1 | | | f | | 1 | {3} 26 | {0,3} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 1 | 42 | 1 | t | 70 | 1 | {3} 42 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 26 | | | f | | 1 | 43 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 26 | | | f | | 1 | 2 | {0} | 2 | 0.555555555555556 | 1 | 1 | 0.17636684303351 | 0 | 9 | 0 | 11 | 1 | t | 65 | 2 | 11 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 2 | | | f | | 2 | 12 | {0,2} | 4 | 0.714285714285714 | 1 | 1 | 0.217687074829932 | 0 | 7 | 2 | 44 | 1 | f | | 2 | 44 | {0,2,1} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 12 | 57 | 1 | f | | 2 | {4} 45 | {0,2,2} | 3 | 1 | 1 | 1 | 0 | 0 | 4 | 12 | | | f | | 2 | {4} 57 | {0,2,1,1} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 78 | 2 | {4,3} 58 | {0,2,1,2} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 96 | 2 | {4,3} 59 | {0,2,1,3} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 44 | | | t | 85 | 2 | {4,3} 3 | {0} | 2 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 27 | 1 | t | 80 | 3 | 27 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 6 | 3 | | | f | | 3 | 28 | {0,2} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 3 | 46 | 1 | t | 90 | 3 | 46 | {0,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 28 | | | f | | 3 | 47 | {0,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 28 | | | f | | 3 | 4 | {0} | 4 | 0.888888888888889 | 1 | 2 | 0.0493827160493827 | 0 | 9 | 0 | 13 | 1 | f | | 4 | 13 | {0,1} | 3 | 1 | 1 | 2 | 0 | 0 | 6 | 4 | | | f | | 4 | {4} 14 | {0,2} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 4 | 48 | 1 | f | | 4 | {4} 48 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 14 | | | t | 90 | 4 | {4,3} 49 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 14 | | | t | 80 | 4 | {4,3} 5 | {0} | 2 | 0.888888888888889 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 29 | 1 | t | 90 | 5 | 29 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 8 | 5 | | | f | | 5 | 30 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 5 | | | f | | 5 | 6 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.345679012345679 | 0 | 9 | 0 | 15 | 1 | f | | 6 | 15 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 6 | | | f | | 6 | {3} 16 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 6 | 51 | 1 | f | | 6 | {3} 17 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 6 | | | f | | 6 | {3} 51 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 16 | | | t | 96 | 6 | {3,4} 52 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 16 | | | t | 70 | 6 | {3,4} 7 | {0} | 4 | 0.666666666666667 | 1 | 2 | 0.253968253968254 | 0 | 9 | 0 | 31 | 1 | f | | 7 | 31 | {0,1} | 2 | 0.857142857142857 | 1 | 2 | 0.102040816326531 | 0 | 7 | 7 | 36 | 1 | t | 80 | 7 | {4} 32 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 2 | 7 | | | f | | 7 | {4} 36 | {0,1,1} | 4 | 1 | 1 | 2 | 0 | 0 | 5 | 31 | | | f | | 7 | 37 | {0,1,2} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 31 | 60 | 1 | t | 95 | 7 | 60 | {0,1,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 37 | | | f | | 7 | 61 | {0,1,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 37 | | | f | | 7 | 8 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.0864197530864197 | 0 | 9 | 0 | 18 | 1 | f | | 8 | 18 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 8 | | | f | | 8 | {3} 19 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 8 | 38 | 1 | f | | 8 | {3} 20 | {0,3} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 8 | 53 | 1 | t | 70 | 8 | {3} 38 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 19 | | | t | 80 | 8 | {3,4} 39 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 19 | | | t | 80 | 8 | {3,4} 53 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 20 | | | f | | 8 | 54 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 20 | | | f | | 8 | 9 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.327160493827161 | 0 | 9 | 0 | 33 | 1 | f | | 9 | 33 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 9 | | | f | | 9 | {3} 34 | {0,2} | 4 | 0.75 | 1 | 2 | 0.375 | 0 | 4 | 9 | 55 | 1 | f | | 9 | {3} 35 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 9 | | | f | | 9 | {3} 55 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 3 | 34 | | | t | 96 | 9 | {3,4} 56 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 34 | | | t | 70 | 9 | {3,4} 10 | {0} | 3 | 0.666666666666667 | 1 | 2 | 0.277777777777778 | 0 | 9 | 0 | 21 | 1 | f | | 10 | 21 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 10 | | | f | | 10 | {3} 22 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 10 | | | f | | 10 | {3} 23 | {0,3} | 2 | 0.75 | 1 | 1 | 0.375 | 0 | 4 | 10 | 40 | 1 | t | 70 | 10 | {3} 40 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 23 | | | f | | 10 | 41 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 23 | | | f | | 10 | (60 rows)
SELECT * FROM madlib.rf_display( 'trained_tree_infogain' );Result:
rf_display ---------------------------------------------------------------------------------------------------- Tree 1 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(2) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(0.666666666666667) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(2) predict_prob(1) Tree 2 Root Node : class( Do not Play) num_elements(9) predict_prob(0.555555555555556) humidity: <= 65 : class( Play) num_elements(2) predict_prob(1) humidity: > 65 : class( Do not Play) num_elements(7) predict_prob(0.714285714285714) windy: = false : class( Play) num_elements(3) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(1) predict_prob(1) outlook: = rain : class( Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(1) predict_prob(1) windy: = true : class( Do not Play) num_elements(4) predict_prob(1) Tree 3 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) humidity: <= 80 : class( Play) num_elements(6) predict_prob(1) humidity: > 80 : class( Do not Play) num_elements(3) predict_prob(0.666666666666667) humidity: <= 90 : class( Do not Play) num_elements(2) predict_prob(1) humidity: > 90 : class( Play) num_elements(1) predict_prob(1) Tree 4 Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889) windy: = false : class( Play) num_elements(6) predict_prob(1) windy: = true : class( Play) num_elements(3) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(2) predict_prob(1) outlook: = rain : class( Do not Play) num_elements(1) predict_prob(1) Tree 5 Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889) humidity: <= 90 : class( Play) num_elements(8) predict_prob(1) humidity: > 90 : class( Do not Play) num_elements(1) predict_prob(1) Tree 6 Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556) outlook: = overcast : class( Play) num_elements(3) predict_prob(1) outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(2) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1) Tree 7 Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(7) predict_prob(0.857142857142857) humidity: <= 80 : class( Play) num_elements(5) predict_prob(1) humidity: > 80 : class( Play) num_elements(2) predict_prob(0.5) humidity: <= 95 : class( Do not Play) num_elements(1) predict_prob(1) humidity: > 95 : class( Play) num_elements(1) predict_prob(1) windy: = true : class( Do not Play) num_elements(2) predict_prob(1) Tree 8 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(2) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Play) num_elements(2) predict_prob(0.5) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(1) predict_prob(1) Tree 9 Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556) outlook: = overcast : class( Play) num_elements(2) predict_prob(1) outlook: = rain : class( Play) num_elements(4) predict_prob(0.75) windy: = false : class( Play) num_elements(3) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1) Tree 10 Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(1) predict_prob(1) outlook: = rain : class( Play) num_elements(4) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(4) predict_prob(0.75) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(3) predict_prob(1) (10 rows)
SELECT * FROM madlib.rf_classify( 'trained_tree_infogain', 'golf_data', 'classification_result' );Result:
input_set_size | classification_time ---------------+--------------------- 14 | 00:00:02.215017 (1 row)
SELECT t.id, t.outlook, t.temperature, t.humidity, t.windy, c.class FROM classification_result c, golf_data t WHERE t.id=c.id ORDER BY id;Result:
id | outlook | temperature | humidity | windy | class ---+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Do not Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
SELECT * FROM madlib.rf_score( 'trained_tree_infogain', 'golf_data_validation', 0 );Result:
rf_score ------------------ 0.928571428571429 (1 row)
SELECT madlib.rf_clean( 'trained_tree_infogain' );Result:
rf_clean --------- t (1 row)
[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[2] http://en.wikipedia.org/wiki/Discretization_of_continuous_features