User Documentation
 All Files Functions Groups
c45.sql_in File Reference

C4.5 APIs and main controller written in PL/PGSQL. More...

Go to the source code of this file.

Functions

c45_train_result c45_train (text split_criterion, text training_table_name, text result_tree_table_name, text validation_table_name, text continuous_feature_names, text feature_col_names, text id_col_name, text class_col_name, float confidence_level, text how2handle_missing_value, int max_tree_depth, float node_prune_threshold, float node_split_threshold, int verbosity)
 This is the long form API of training tree with all specified parameters. More...
 
c45_train_result c45_train (text split_criterion, text training_table_name, text result_tree_table_name, text validation_table_name, text continuous_feature_names, text feature_col_names, text id_col_name, text class_col_name, float confidence_level, text how2handle_missing_value)
 C45 train algorithm in short form. More...
 
c45_train_result c45_train (text split_criterion, text training_table_name, text result_tree_table_name)
 C45 train algorithm in short form. More...
 
set< text > c45_genrule (text tree_table_name, int verbosity)
 Display the trained decision tree model with rules. More...
 
set< text > c45_genrule (text tree_table_name)
 Display the trained decision tree model with rules. More...
 
set< text > c45_display (text tree_table, int max_depth)
 Display the trained decision tree model with human readable format. More...
 
set< text > c45_display (text tree_table)
 Display the whole trained decision tree model with human readable format. More...
 
c45_classify_result c45_classify (text tree_table_name, text classification_table_name, text result_table_name, int verbosity)
 Classify dataset using trained decision tree model. The classification result will be stored in the table which is defined as: CREATE TABLE classification_result ( id INT|BIGINT, class SUPPORTED_DATA_TYPE, prob FLOAT );. More...
 
c45_classify_result c45_classify (text tree_table_name, text classification_table_name, text result_table_name)
 Classify dataset using trained decision tree model. It runs in quiet mode. The classification result will be stored in the table which is defined as: More...
 
float8 c45_score (text tree_table_name, text scoring_table_name, int verbosity)
 Check the accuracy of the decision tree model. More...
 
float8 c45_score (text tree_table_name, text scoring_table_name)
 Check the accuracy of the decision tree model. More...
 
boolean c45_clean (text result_tree_table_name)
 Cleanup the trained tree table and any relevant tables. More...
 

Detailed Description

Date
April 5, 2012
See Also
For a brief introduction to decision trees, see the module description Decision Tree.

Definition in file c45.sql_in.

Function Documentation

c45_classify_result c45_classify ( text  tree_table_name,
text  classification_table_name,
text  result_table_name,
int  verbosity 
)
Parameters
tree_table_nameThe name of trained tree.
classification_table_nameThe name of the table/view with the source data.
result_table_nameThe name of result table.
verbosity> 0 means this function runs in verbose mode.
Returns
A c45_classify_result object.

Definition at line 1020 of file c45.sql_in.

c45_classify_result c45_classify ( text  tree_table_name,
text  classification_table_name,
text  result_table_name 
)
     CREATE TABLE classification_result
     (
         id        INT|BIGINT,
         class     SUPPORTED_DATA_TYPE,
         prob      FLOAT
     ); 
Parameters
tree_table_nameThe name of trained tree.
classification_table_nameThe name of the table/view with the source data.
result_table_nameThe name of result table.
Returns
A c45_classify_result object.

Definition at line 1131 of file c45.sql_in.

boolean c45_clean ( text  result_tree_table_name)
Parameters
result_tree_table_nameThe name of the table containing the tree's information.
Returns
The status of that cleanup operation.

Definition at line 1225 of file c45.sql_in.

set<text> c45_display ( text  tree_table,
int  max_depth 
)
Parameters
tree_tableThe name of the table containing the tree's information.
max_depthThe max depth to be displayed. If null, this function will show all levels.
Returns
The text representing the tree with human readable format.

Definition at line 936 of file c45.sql_in.

set<text> c45_display ( text  tree_table)
Parameters
tree_table,:The name of the table containing the tree's information.
Returns
The text representing the tree with human readable format.

Definition at line 985 of file c45.sql_in.

set<text> c45_genrule ( text  tree_table_name,
int  verbosity 
)
Parameters
tree_table_nameThe name of the table containing the tree's information.
verbosityIf >= 1 will run in verbose mode.
Returns
The rule representation text for a decision tree.

Definition at line 620 of file c45.sql_in.

set<text> c45_genrule ( text  tree_table_name)
Parameters
tree_table_nameThe name of the table containing the tree's information.
Returns
The rule representation text for a decision tree.

Definition at line 904 of file c45.sql_in.

float8 c45_score ( text  tree_table_name,
text  scoring_table_name,
int  verbosity 
)
Parameters
tree_table_nameThe name of the trained tree.
scoring_table_nameThe name of the table/view with the source data.
verbosity> 0 means this function runs in verbose mode.
Returns
The estimated accuracy information.

Definition at line 1166 of file c45.sql_in.

float8 c45_score ( text  tree_table_name,
text  scoring_table_name 
)
Parameters
tree_table_nameThe name of the trained tree.
scoring_table_nameThe name of the table/view with the source data.
Returns
The estimated accuracy information.

Definition at line 1196 of file c45.sql_in.

c45_train_result c45_train ( text  split_criterion,
text  training_table_name,
text  result_tree_table_name,
text  validation_table_name,
text  continuous_feature_names,
text  feature_col_names,
text  id_col_name,
text  class_col_name,
float  confidence_level,
text  how2handle_missing_value,
int  max_tree_depth,
float  node_prune_threshold,
float  node_split_threshold,
int  verbosity 
)
Parameters
split_criterionThe name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.
training_table_nameThe name of the table/view with the source data.
result_tree_table_nameThe name of the table where the resulting DT will be kept.
validation_table_nameThe name of the table/view that contains the validation set used for tree pruning. The default is NULL, in which case we will not do tree pruning.
continuous_feature_namesA comma-separated list of the names of features whose values are continuous. The default is null, which means there are no continuous features in the training table.
feature_col_namesA comma-separated list of the names of table columns, each of which defines a feature. The default value is null, which means all the columns in the training table, except columns named ‘id’ and ‘class’, will be used as features.
id_col_nameThe name of the column containing an ID for each record.
class_col_nameThe name of the column containing the labeled class.
confidence_levelA statistical confidence interval of the resubstitution error.
how2handle_missing_valueThe way to handle missing value. The valid value is 'explicit' or 'ignore'.
max_tree_depthSpecifies the maximum number of levels in the result DT to avoid overgrown DTs.
node_prune_thresholdThe minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if its value is 1, then the trained tree only has one node (the root node); if its value is 0, then no nodes will be pruned by this parameter.
node_split_thresholdThe minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If it's value is 1, then the trained tree only has two levels, since only the root node can grow; if its value is 0, then trees can grow extensively.
verbosity> 0 means this function runs in verbose mode.
Returns
An c45_train_result object.

Definition at line 369 of file c45.sql_in.

c45_train_result c45_train ( text  split_criterion,
text  training_table_name,
text  result_tree_table_name,
text  validation_table_name,
text  continuous_feature_names,
text  feature_col_names,
text  id_col_name,
text  class_col_name,
float  confidence_level,
text  how2handle_missing_value 
)
Parameters
split_criterionThe name of the split criterion that should be used for tree construction. Possible values are ‘gain’, ‘gainratio’, and ‘gini’.
training_table_nameThe name of the table/view with the source data.
result_tree_table_nameThe name of the table where the resulting DT will be kept.
validation_table_nameThe name of the table/view that contains the validation set used for tree pruning. The default is NULL, in which case we will not do tree pruning.
continuous_feature_namesA comma-separated list of the names of features whose values are continuous. The default is null, which means there are no continuous features in the training table.
feature_col_namesA comma-separated list of the names of table columns, each of which defines a feature. The default value is null, which means all the columns in the training table, except columns named ‘id’ and ‘class’, will be used as features.
id_col_nameThe name of the column containing an ID for each record.
class_col_nameThe name of the column containing the labeled class.
confidence_levelA statistical confidence interval of the resubstitution error.
how2handle_missing_valueThe way to handle missing value. The valid value is 'explicit' or 'ignore'.
Returns
An c45_train_result object.
Note
This calls the long form of C45 with the following default parameters:
  • max_tree_deapth := 10
  • node_prune_threshold := 0.001
  • node_split_threshold := 0.01
  • verbosity := 0

Definition at line 516 of file c45.sql_in.

c45_train_result c45_train ( text  split_criterion,
text  training_table_name,
text  result_tree_table_name 
)
Parameters
split_criterionThe name of the split criterion that should be used for tree construction. Possible values are ‘gain’, ‘gainratio’, and ‘gini’.
training_table_nameThe name of the table/view with the source data.
result_tree_table_nameThe name of the table where the resulting DT will be kept.
Returns
An c45_train_result object.
Note
This calls the above short form of C45 with the following default parameters:
  • validation_table_name := NULL
  • continuous_feature_names := NULL
  • id_column_name := 'id'
  • class_column_name := 'class'
  • confidence_level := 25
  • how2handle_missing_value := 'explicit'
  • max_tree_deapth := 10
  • node_prune_threshold := 0.001
  • node_split_threshold := 0.01
  • verbosity := 0

Definition at line 582 of file c45.sql_in.