MADlib
1.0 A newer version is available
User Documentation
|
This module provides an implementation of the C4.5 implementation to grow decision trees.
The implementation supports:
The training data is expected to be of the following form:
{TABLE|VIEW} trainingSource ( ... id INT|BIGINT, feature1 SUPPORTED_DATA_TYPE, feature2 SUPPORTED_DATA_TYPE, feature3 SUPPORTED_DATA_TYPE, .................... featureN SUPPORTED_DATA_TYPE, class SUPPORTED_DATA_TYPE, ... )
The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.
The data to classify is expected to be of the same form as training data, except that it does not need a class column.
SELECT * FROM c45_train( 'split_criterion', 'training_table_name', 'result_tree_table_name', 'validation_table_name', 'continuous_feature_names', 'feature_col_names', 'id_col_name', 'class_col_name', 'confidence_level', 'how2handle_missing_value' 'max_tree_depth', 'node_prune_threshold', 'node_split_threshold' 'verbosity');This will create the decision tree output table storing an abstract object (representing the model) used for further classification. Column names:
id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live | sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value | tid | dp_ids ----+---------------+---------+-------------------+------------------+----------+-------------------+------+-----------+-----------+---------+----------+-----------------+-------------+-----+-------- ...
SELECT * FROM c45_classify( 'tree_table_name', 'classification_table_name', 'result_table_name');This will create the result_table with the classification results.
SELECT * FROM c45_score( 'tree_table_name', 'validation_table_name', 'verbosity');This will give a ratio of correctly classified items in the validation set.
SELECT * FROM c45_display( 'tree_table_name');This will display the trained tree in human readable format.
SELECT * FROM c45_clean( 'tree_table_name');This will clean up the learned model and all metadata.
sql> select * from golf_data order by id; id | outlook | temperature | humidity | windy | class ----+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
sql> SELECT * FROM MADlib.c45_clean('trained_tree_infogain'); sql> SELECT * FROM MADlib.c45_train( 'infogain', -- split criterion_name 'golf_data', -- input table name 'trained_tree_infogain', -- result tree name null, -- validation table name 'temperature,humidity', -- continuous feature names 'outlook,temperature,humidity,windy', -- feature column names 'id', -- id column name 'class', -- class column name 100, -- confidence level 'explicit', -- missing value preparation 5, -- max tree depth 0.001, -- min percent mode 0.001, -- min percent split 0); -- verbosity training_set_size | tree_nodes | tree_depth | training_time | split_criterion -------------------+------------+------------+-----------------+----------------- 14 | 8 | 3 | 00:00:00.871805 | infogain (1 row)
sql> select * from trained_tree_infogain order by id; id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value ----+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+------------- 1 | {0} | 3 | 0.642857142857143 | 1 | 2 | 0.171033941880327 | 0 | 14 | 0 | 2 | 1 | f | 2 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f | 3 | {0,2} | 4 | 0.6 | 1 | 2 | 0.673011667009257 | 0 | 5 | 1 | 5 | 1 | f | 4 | {0,3} | 2 | 0.6 | 1 | 1 | 0.673011667009257 | 0 | 5 | 1 | 7 | 1 | t | 70 5 | {0,2,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 3 | | | f | 6 | {0,2,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 3 | | | f | 7 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 4 | | | f | 8 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 4 | | | f | (8 rows)
sql> select MADlib.c45_display('trained_tree_infogain'); c45_display --------------------------------------------------------------------------------------- Tree 1 Root Node : class( Play) num_elements(14) predict_prob(0.642857142857143) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(5) predict_prob(0.6) windy: = false : class( Play) num_elements(3) predict_prob(1) windy: = true : class( Do not Play) num_elements(2) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(5) predict_prob(0.6) humidity: <= 70 : class( Play) num_elements(2) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(3) predict_prob(1) (1 row)
sql> select * from MADlib.c45_classify 'trained_tree_infogain', -- name of the trained model 'golf_data', -- name of the table containing data to classify 'classification_result'); -- name of the output table input_set_size | classification_time ----------------+----------------- 14 | 00:00:00.247713 (1 row)
sql> select t.id,t.outlook,t.temperature,t.humidity,t.windy,c.class from MADlib.classification_result c,golf_data t where t.id=c.id order by id; id | outlook | temperature | humidity | windy | class ----+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
sql> select * from MADlib.c45_score( 'trained_tree_infogain', 'golf_data_validation', 0); c45_score ----------- 1 (1 row)
testdb=# select MADLIB_SCHEMA.c45_clean('trained_tree_infogain'); c45_clean ----------- (1 row)
[1] http://en.wikipedia.org/wiki/C4.5_algorithm