This module provides an implementation of the C4.5 algorithm to grow decision trees.
The implementation supports:
Run the training algorithm on the source data:
c45_train( split_criterion, training_table_name, result_tree_table_name, validation_table_name, continuous_feature_names, feature_col_names, id_col_name, class_col_name, confidence_level, how2handle_missing_value max_tree_depth, node_prune_threshold, node_split_threshold, verbosity )
Arguments
{TABLE|VIEW} trainingSource ( ... id INT|BIGINT, feature1 SUPPORTED_DATA_TYPE, feature2 SUPPORTED_DATA_TYPE, feature3 SUPPORTED_DATA_TYPE, .................... featureN SUPPORTED_DATA_TYPE, class SUPPORTED_DATA_TYPE, ... )The detailed list of SUPPORTED_DATA_TYPE is: SMALLINT, INT, BIGINT, FLOAT8, REAL, DECIMAL, INET, CIDR, MACADDR, BOOLEAN, CHAR, VARCHAR, TEXT, "char", DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ, and INTERVAL.
id |
---|
tree_location |
feature |
probability |
ebp_coeff |
maxclass |
scv |
live |
sample_size |
parent_id |
lmc_nid |
lmc_fval |
is_continuous |
split_value |
tid |
dp_ids |
The classification function uses the learned model stored by the training function to create the classification results.
c45_classify( tree_table_name, classification_table_name, result_table_name )
Arguments
The name of the table containing the trained model.
The data to classify is expected to be in the same form as the training data, except that it does not need a class column. The name of the table containing the data to classify.
c45_score( tree_table_name, validation_table_name, verbosity );
This gives a ratio of correctly classified items in the validation set.
The display tree function displays the learned model in a human-readable format.
c45_display( tree_table_name );
The clean tree function cleans up the learned model and all metadata.
c45_clean( tree_table_name );
SELECT * FROM golf_data ORDER BY id;Result:
id | outlook | temperature | humidity | windy | class ---+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
SELECT * FROM madlib.c45_clean( 'trained_tree_infogain' ); SELECT * FROM madlib.c45_train( 'infogain', 'golf_data', 'trained_tree_infogain', null, 'temperature,humidity', 'outlook,temperature,humidity,windy', 'id', 'class', 100, 'explicit', 5, 0.001, 0.001, 0 );Result:
training_set_size | tree_nodes | tree_depth | training_time | split_criterion ------------------+------------+------------+-----------------+----------------- 14 | 8 | 3 | 00:00:00.871805 | infogain (1 row)
SELECT * FROM trained_tree_infogain ORDER BY id;Result:
id | tree_location | feature | probability | ebp_coeff | maxclass | scv | live |sample_size | parent_id | lmc_nid | lmc_fval | is_continuous | split_value ---+---------------+---------+-------------------+-----------+----------+-------------------+------+----------+-----------+---------+----------+-----------------+------------- 1 | {0} | 3 | 0.642857142857143 | 1 | 2 | 0.171033941880327 | 0 | 14 | 0 | 2 | 1 | f | 2 | {0,1} | 4 | 1 | 1 | 2 | 0 | 0 | 4 | 1 | | | f | 3 | {0,2} | 4 | 0.6 | 1 | 2 | 0.673011667009257 | 0 | 5 | 1 | 5 | 1 | f | 4 | {0,3} | 2 | 0.6 | 1 | 1 | 0.673011667009257 | 0 | 5 | 1 | 7 | 1 | t | 70 5 | {0,2,1} | 4 | 1 | 1 | 2 | 0 | 0 | 3 | 3 | | | f | 6 | {0,2,2} | 4 | 1 | 1 | 1 | 0 | 0 | 2 | 3 | | | f | 7 | {0,3,1} | 4 | 1 | 1 | 2 | 0 | 0 | 2 | 4 | | | f | 8 | {0,3,2} | 4 | 1 | 1 | 1 | 0 | 0 | 3 | 4 | | | f | (8 rows)
SELECT madlib.c45_display('trained_tree_infogain');Result:
c45_display -------------------------------------------------------------------------------------- Tree 1 Root Node : class( Play) num_elements(14) predict_prob(0.642857142857143) outlook: = overcast : class( Play) num_elements(4) predict_prob(1) outlook: = rain : class( Play) num_elements(5) predict_prob(0.6) windy: = false : class( Play) num_elements(3) predict_prob(1) windy: = true : class( Do not Play) num_elements(2) predict_prob(1) outlook: = sunny : class( Do not Play) num_elements(5) predict_prob(0.6) humidity: <= 70 : class( Play) num_elements(2) predict_prob(1) humidity: > 70 : class( Do not Play) num_elements(3) predict_prob(1) (1 row)
SELECT * FROM madlib.c45_classify ( 'trained_tree_infogain', 'golf_data', 'classification_result' );Result:
input_set_size | classification_time ----------------+----------------- 14 | 00:00:00.247713 (1 row)
SELECT t.id, t.outlook, t.temperature, t.humidity, t.windy, c.class FROM madlib.classification_result c, golf_data t WHERE t.id=c.id ORDER BY id;Result:
id | outlook | temperature | humidity | windy | class ---+----------+-------------+----------+--------+-------------- 1 | sunny | 85 | 85 | false | Do not Play 2 | sunny | 80 | 90 | true | Do not Play 3 | overcast | 83 | 78 | false | Play 4 | rain | 70 | 96 | false | Play 5 | rain | 68 | 80 | false | Play 6 | rain | 65 | 70 | true | Do not Play 7 | overcast | 64 | 65 | true | Play 8 | sunny | 72 | 95 | false | Do not Play 9 | sunny | 69 | 70 | false | Play 10 | rain | 75 | 80 | false | Play 11 | sunny | 75 | 70 | true | Play 12 | overcast | 72 | 90 | true | Play 13 | overcast | 81 | 75 | false | Play 14 | rain | 71 | 80 | true | Do not Play (14 rows)
SELECT * FROM madlib.c45_score( 'trained_tree_infogain', 'golf_data_validation', 0) );Result:
c45_score ---------- 1 (1 row)
SELECT madlib.c45_clean( 'trained_tree_infogain' );Result:
c45_clean ---------- (1 row)
[1] http://en.wikipedia.org/wiki/C4.5_algorithm