MADlib
0.7 A newer version is available
User Documentation
|
It has the following well-known advantages:
This module provides an implementation of the random forest algorithm described in [1].
The implementation supports:
The training data is expected to be of the following form:
{TABLE|VIEW} trainingSource ( ... id INT|BIGINT, feature1 SUPPORTED_DATA_TYPE, feature2 SUPPORTED_DATA_TYPE, feature3 SUPPORTED_DATA_TYPE, .................... featureN SUPPORTED_DATA_TYPE, class SUPPORTED_DATA_TYPE, ... )
The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.
The data to classify is expected to be of the same form as training data, except that it does not need a class column.
SELECT * FROM rf_train( 'split_criterion', 'training_table_name', 'result_rf_table_name', 'num_trees', 'features_per_node', 'sampling_percentage', 'continuous_feature_names', 'feature_col_names', 'id_col_name', 'class_col_name' 'how2handle_missing_value', 'max_tree_depth', 'node_prune_threshold', 'node_split_threshold', 'verbosity');This will create the decision tree output table storing an abstract object (representing the model) used for further classification. Column names:
id | tree_location | feature | probability | ebp_coeff | maxclass | split_gain | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids ----+---------------+---------+-------------------+------------------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+-------- ...
SELECT * FROM rf_classify( 'rf_table_name', 'classification_table_name', 'result_table_name');This will create the result_table with the classification results.
SELECT * FROM rf_score( 'rf_table_name', 'validation_table_name', 'verbosity');This will give a ratio of correctly classified items in the validation set.
SELECT * FROM rf_display( 'rf_table_name');This will display the trained trees in human readable format.
SELECT * FROM rf_clean( 'rf_table_name');This will clean up the learned model and all metadata.
sql> select * from golf_data order by id; id | outlook | temperature | humidity | windy | class ----+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
sql> SELECT * FROM MADlib.rf_clean('trained_tree_infogain'); sql> SELECT * FROM MADlib.rf_train( 'infogain', -- split criterion_name 'golf_data', -- input table name 'trained_tree_infogain', -- result tree name 10, -- number of trees NULL, -- features_per_node 0.632, -- sampling_percentage 'temperature,humidity', -- continuous feature names 'outlook,temperature,humidity,windy', -- feature column names 'id', -- id column name 'class', -- class column name 'explicit', -- how to handle missing value 10, -- max tree depth 0.0, -- min percent mode 0.0, -- min percent split 0 -- max split point 0); -- verbosity training_time | num_of_samples | num_trees | features_per_node | num_tree_nodes | max_tree_depth | split_criterion | acs_time | acc_time | olap_time | update_time | best_time ----------------+--------------+-----------+-------------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------- 00:00:03.60498 | 14 | 10 | 3 | 71 | 6 | infogain | 00:00:00.154991 | 00:00:00.404411 | 00:00:00.736876 | 00:00:00.374084 | 00:00:01.722658 (1 row)
sql> select * from golf_tree order by tid,id; id | tree_location | feature | probability | ebp_coeff | maxclass | split_gain | live | cat_size | parent_id | lmc_nid | lmc_fval | is_feature_cont | split_value | tid | dp_ids ----+---------------+---------+-------------------+-----------+----------+--------------------+------+----------+-----------+---------+----------+-----------------+-------------+-----+-------- 1 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 24 | 1 | f | | 1 | 24 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f | | 1 | {3} 25 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 1 | | | f | | 1 | {3} 26 | {0,3} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 1 | 42 | 1 | t | 70 | 1 | {3} 42 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 26 | | | f | | 1 | 43 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 26 | | | f | | 1 | 2 | {0} | 2 | 0.555555555555556 | 1 | 1 | 0.17636684303351 | 0 | 9 | 0 | 11 | 1 | t | 65 | 2 | 11 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 2 | | | f | | 2 | 12 | {0,2} | 4 | 0.714285714285714 | 1 | 1 | 0.217687074829932 | 0 | 7 | 2 | 44 | 1 | f | | 2 | 44 | {0,2,1} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 12 | 57 | 1 | f | | 2 | {4} 45 | {0,2,2} | 3 | 1 | 1 | 1 | 0 | 0 | 4 | 12 | | | f | | 2 | {4} 57 | {0,2,1,1} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 78 | 2 | {4,3} 58 | {0,2,1,2} | 2 | 1 | 1 | 2 | 0 | 0 | 1 | 44 | | | t | 96 | 2 | {4,3} 59 | {0,2,1,3} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 44 | | | t | 85 | 2 | {4,3} 3 | {0} | 2 | 0.777777777777778 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 27 | 1 | t | 80 | 3 | 27 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 6 | 3 | | | f | | 3 | 28 | {0,2} | 2 | 0.666666666666667 | 1 | 1 | 0.444444444444444 | 0 | 3 | 3 | 46 | 1 | t | 90 | 3 | 46 | {0,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 28 | | | f | | 3 | 47 | {0,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 28 | | | f | | 3 | 4 | {0} | 4 | 0.888888888888889 | 1 | 2 | 0.0493827160493827 | 0 | 9 | 0 | 13 | 1 | f | | 4 | 13 | {0,1} | 3 | 1 | 1 | 2 | 0 | 0 | 6 | 4 | | | f | | 4 | {4} 14 | {0,2} | 3 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 4 | 48 | 1 | f | | 4 | {4} 48 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 14 | | | t | 90 | 4 | {4,3} 49 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 14 | | | t | 80 | 4 | {4,3} 5 | {0} | 2 | 0.888888888888889 | 1 | 2 | 0.197530864197531 | 0 | 9 | 0 | 29 | 1 | t | 90 | 5 | 29 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 8 | 5 | | | f | | 5 | 30 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 5 | | | f | | 5 | 6 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.345679012345679 | 0 | 9 | 0 | 15 | 1 | f | | 6 | 15 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 6 | | | f | | 6 | {3} 16 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 6 | 51 | 1 | f | | 6 | {3} 17 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 6 | | | f | | 6 | {3} 51 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 16 | | | t | 96 | 6 | {3,4} 52 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 16 | | | t | 70 | 6 | {3,4} 7 | {0} | 4 | 0.666666666666667 | 1 | 2 | 0.253968253968254 | 0 | 9 | 0 | 31 | 1 | f | | 7 | 31 | {0,1} | 2 | 0.857142857142857 | 1 | 2 | 0.102040816326531 | 0 | 7 | 7 | 36 | 1 | t | 80 | 7 | {4} 32 | {0,2} | 3 | 1 | 1 | 1 | 0 | 0 | 2 | 7 | | | f | | 7 | {4} 36 | {0,1,1} | 4 | 1 | 1 | 2 | 0 | 0 | 5 | 31 | | | f | | 7 | 37 | {0,1,2} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 31 | 60 | 1 | t | 95 | 7 | 60 | {0,1,2,1} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 37 | | | f | | 7 | 61 | {0,1,2,2} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 37 | | | f | | 7 | 8 | {0} | 3 | 0.777777777777778 | 1 | 2 | 0.0864197530864197 | 0 | 9 | 0 | 18 | 1 | f | | 8 | 18 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 8 | | | f | | 8 | {3} 19 | {0,2} | 4 | 0.666666666666667 | 1 | 2 | 0.444444444444444 | 0 | 3 | 8 | 38 | 1 | f | | 8 | {3} 20 | {0,3} | 2 | 0.5 | 1 | 2 | 0.5 | 0 | 2 | 8 | 53 | 1 | t | 70 | 8 | {3} 38 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 2 | 19 | | | t | 80 | 8 | {3,4} 39 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 19 | | | t | 80 | 8 | {3,4} 53 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 20 | | | f | | 8 | 54 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 20 | | | f | | 8 | 9 | {0} | 3 | 0.555555555555556 | 1 | 2 | 0.327160493827161 | 0 | 9 | 0 | 33 | 1 | f | | 9 | 33 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 9 | | | f | | 9 | {3} 34 | {0,2} | 4 | 0.75 | 1 | 2 | 0.375 | 0 | 4 | 9 | 55 | 1 | f | | 9 | {3} 35 | {0,3} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 9 | | | f | | 9 | {3} 55 | {0,2,1} | 2 | 1 | 1 | 2 | 0 | 0 | 3 | 34 | | | t | 96 | 9 | {3,4} 56 | {0,2,2} | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 34 | | | t | 70 | 9 | {3,4} 10 | {0} | 3 | 0.666666666666667 | 1 | 2 | 0.277777777777778 | 0 | 9 | 0 | 21 | 1 | f | | 10 | 21 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 10 | | | f | | 10 | {3} 22 | {0,2} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 10 | | | f | | 10 | {3} 23 | {0,3} | 2 | 0.75 | 1 | 1 | 0.375 | 0 | 4 | 10 | 40 | 1 | t | 70 | 10 | {3} 40 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 1 | 23 | | | f | | 10 | 41 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 23 | | | f | | 10 | (60 rows)
sql> select * from MADlib.rf_display('trained_tree_infogain'); rf_display ----------------------------------------------------------------------------------------------------- Tree 1 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(2) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(0.666666666666667) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(2) predict_prob(1) Tree 2 Root Node : class( Do not Play) num_elements(9) predict_prob(0.555555555555556) humidity: <= 65 : class( Play) num_elements(2) predict_prob(1) humidity: > 65 : class( Do not Play) num_elements(7) predict_prob(0.714285714285714) windy: = false : class( Play) num_elements(3) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(1) predict_prob(1) outlook: = rain : class( Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(1) predict_prob(1) windy: = true : class( Do not Play) num_elements(4) predict_prob(1) Tree 3 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) humidity: <= 80 : class( Play) num_elements(6) predict_prob(1) humidity: > 80 : class( Do not Play) num_elements(3) predict_prob(0.666666666666667) humidity: <= 90 : class( Do not Play) num_elements(2) predict_prob(1) humidity: > 90 : class( Play) num_elements(1) predict_prob(1) Tree 4 Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889) windy: = false : class( Play) num_elements(6) predict_prob(1) windy: = true : class( Play) num_elements(3) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(2) predict_prob(1) outlook: = rain : class( Do not Play) num_elements(1) predict_prob(1) Tree 5 Root Node : class( Play) num_elements(9) predict_prob(0.888888888888889) humidity: <= 90 : class( Play) num_elements(8) predict_prob(1) humidity: > 90 : class( Do not Play) num_elements(1) predict_prob(1) Tree 6 Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556) outlook: = overcast : class( Play) num_elements(3) predict_prob(1) outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(2) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1) Tree 7 Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(7) predict_prob(0.857142857142857) humidity: <= 80 : class( Play) num_elements(5) predict_prob(1) humidity: > 80 : class( Play) num_elements(2) predict_prob(0.5) humidity: <= 95 : class( Do not Play) num_elements(1) predict_prob(1) humidity: > 95 : class( Play) num_elements(1) predict_prob(1) windy: = true : class( Do not Play) num_elements(2) predict_prob(1) Tree 8 Root Node : class( Play) num_elements(9) predict_prob(0.777777777777778) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(3) predict_prob(0.666666666666667) windy: = false : class( Play) num_elements(2) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Play) num_elements(2) predict_prob(0.5) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(1) predict_prob(1) Tree 9 Root Node : class( Play) num_elements(9) predict_prob(0.555555555555556) outlook: = overcast : class( Play) num_elements(2) predict_prob(1) outlook: = rain : class( Play) num_elements(4) predict_prob(0.75) windy: = false : class( Play) num_elements(3) predict_prob(1) windy: = true : class( Do not Play) num_elements(1) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(3) predict_prob(1) Tree 10 Root Node : class( Play) num_elements(9) predict_prob(0.666666666666667) outlook: = overcast : class( Play) num_elements(1) predict_prob(1) outlook: = rain : class( Play) num_elements(4) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(4) predict_prob(0.75) humidity: <= 70 : class( Play) num_elements(1) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(3) predict_prob(1) (10 rows)
sql> select * from MADlib.rf_classify( 'trained_tree_infogain', -- name of the trained model 'golf_data', -- name of the table containing data to classify 'classification_result'); -- name of the output table input_set_size | classification_time ----------------+--------------------- 14 | 00:00:02.215017 (1 row)
sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from classification_result c,golf_data t where t.id=c.id order by id; id | outlook | temperature | humidity | windy | class ----+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Do not Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
sql> select * from MADlib.rf_score( 'trained_tree_infogain', 'golf_data_validation', 0); rf_score ------------------- 0.928571428571429 (1 row)
testdb=# select MADLIB_SCHEMA.rf_clean('trained_tree_infogain'); rf_clean ---------- t (1 row)
[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[2] http://en.wikipedia.org/wiki/Discretization_of_continuous_features