User Documentation
 All Files Functions Groups
rf.sql_in File Reference

random forest APIs and main control logic written in PL/PGSQL More...

Go to the source code of this file.

Functions

rf_train_result rf_train (text split_criterion, text training_table_name, text result_rf_table_name, int num_trees, int features_per_node, float sampling_percentage, text continuous_feature_names, text feature_col_names, text id_col_name, text class_col_name, text how2handle_missing_value, int max_tree_depth, float node_prune_threshold, float node_split_threshold, int verbosity)
 This API is defined for training a random forest. The training function provides a number of parameters that enables more flexible controls on how an RF is generated. It constructs the RF based on a training set stored in a database table, each row of which defines a set of features, an ID, and a labeled class. Features could be either discrete or continuous. All the DTs of the result RF will be kept in a single table. More...
 
rf_train_result rf_train (text split_criterion, text training_table_name, text result_rf_table_name)
 This API (short form) is defined for training a random forest. For convenience, a short form of the training API with three parameters is also defined. This one needs only the split criterion name, the name of the table where training data is kept, and the name of the table where the trained RF should be kept. All other parameters in the full form will take their default values. More...
 
set< text > rf_display (text rf_table_name, int[] tree_id, int max_depth)
 Display the trees in the random forest with human readable format. More...
 
set< text > rf_display (text rf_table_name, int[] tree_id)
 Display the trees in the random forest with human readable format. This function displays all the levels of these specified trees. More...
 
set< text > rf_display (text rf_table_name)
 Display the trees in the random forest with human readable format. This function displays all the levels of all trees in RF. More...
 
rf_classify_result rf_classify (text rf_table_name, text classification_table_name, text result_table_name, boolean is_serial_classification, int verbosity)
 Classify dataset using a trained RF. More...
 
rf_classify_result rf_classify (text rf_table_name, text classification_table_name, text result_table_name, int verbosity)
 Classify dataset using a trained RF. This function does the same thing as the full version defined as above except that it will only use parallel classification. More...
 
rf_classify_result rf_classify (text rf_table_name, text classification_table_name, text result_table_name)
 Classify dataset using a trained RF. This function does the same thing as the full version defined as above except that it will only use parallel classification and run in quiet mode. More...
 
float8 rf_score (text rf_table_name, text scoring_table_name, int verbosity)
 Check the accuracy of a trained RF with a scoring set. More...
 
float8 rf_score (text rf_table_name, text scoring_table_name)
 Check the accuracy of a trained RF with a scoring set in quiet mode. More...
 
boolean rf_clean (text rf_table_name)
 Cleanup the trained random forest table and any relevant tables. More...
 

Detailed Description

Date
April 5, 2012

Definition in file rf.sql_in.

Function Documentation

rf_classify_result rf_classify ( text  rf_table_name,
text  classification_table_name,
text  result_table_name,
boolean  is_serial_classification,
int  verbosity 
)

The classification result will be stored in the table which is defined as: CREATE TABLE classification_result ( id INT|BIGINT, class SUPPORTED_DATA_TYPE, prob FLOAT );

Parameters
rf_table_nameThe name of RF table. It can't be NULL.
classification_table_nameThe name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_nameThe name of result table. It can't be NULL and must exist.
is_serial_classificationWhether classify with all trees at a time or one by one. It can't be NULL.
verbosity> 0 means this function runs in verbose mode. It can't be NULL.
Returns
A rf_classify_result object.

Definition at line 886 of file rf.sql_in.

rf_classify_result rf_classify ( text  rf_table_name,
text  classification_table_name,
text  result_table_name,
int  verbosity 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL.
classification_table_nameThe name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_nameThe name of result table. It can't be NULL and must exist.
verbosity> 0 means this function runs in verbose mode. It can't be NULL.
Returns
A rf_classify_result object.

Definition at line 1006 of file rf.sql_in.

rf_classify_result rf_classify ( text  rf_table_name,
text  classification_table_name,
text  result_table_name 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL.
classification_table_nameThe name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_nameThe name of result table. It can't be NULL and must exist.
Returns
A rf_classify_result object.

Definition at line 1044 of file rf.sql_in.

boolean rf_clean ( text  rf_table_name)
Parameters
rf_table_nameThe name of RF table. It can't be NULL.
Returns
The status of that cleanup operation.

Definition at line 1127 of file rf.sql_in.

set<text> rf_display ( text  rf_table_name,
int[]  tree_id,
int  max_depth 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL and must exist.
tree_idThe trees to be displayed. If it's NULL, we display all the trees.
max_depthThe max depth to be displayed. If It's NULL, this function will show all levels.
Returns
The text representing the trees in random forest with human readable format.

Definition at line 724 of file rf.sql_in.

set<text> rf_display ( text  rf_table_name,
int[]  tree_id 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL and must exist.
tree_idThe trees to be displayed. If it's NULL, we display all the trees.
Returns
The text representing the trees in random forest with human readable format.

Definition at line 817 of file rf.sql_in.

set<text> rf_display ( text  rf_table_name)
Parameters
rf_table_nameThe name of RF table. It can't be NULL and must exist.
Returns
The text representing the trees in random forest with human readable format.

Definition at line 845 of file rf.sql_in.

float8 rf_score ( text  rf_table_name,
text  scoring_table_name,
int  verbosity 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL.
scoring_table_nameThe name of the table/view that keeps the data to be scored. It can't be NULL and must exist.
verbosity> 0 means this function runs in verbose mode. It can't be NULL.
Returns
The estimated accuracy information.

Definition at line 1079 of file rf.sql_in.

float8 rf_score ( text  rf_table_name,
text  scoring_table_name 
)
Parameters
rf_table_nameThe name of RF table. It can't be NULL.
scoring_table_nameThe name of the table/view that keeps the data to be scored. It can't be NULL and must exist.
Returns
The estimated accuracy information.

Definition at line 1107 of file rf.sql_in.

rf_train_result rf_train ( text  split_criterion,
text  training_table_name,
text  result_rf_table_name,
int  num_trees,
int  features_per_node,
float  sampling_percentage,
text  continuous_feature_names,
text  feature_col_names,
text  id_col_name,
text  class_col_name,
text  how2handle_missing_value,
int  max_tree_depth,
float  node_prune_threshold,
float  node_split_threshold,
int  verbosity 
)

We discretize continuous features on local regions during training rather than discretizing on the whole dataset prior to training because local discretization takes into account the context sensitivity.

Parameters
split_criterionThe name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.
training_table_nameThe name of the table/view with the training data. It can't be NULL and must exist.
result_rf_table_nameThe name of the table where the resulting trees will be stored. It can't be NULL and must not exist.
num_treesThe number of trees to be trained. If it's NULL, 10 will be used.
features_per_nodeThe number of features to be considered when finding a best split. If it's NULL, sqrt(p), where p is the number of features, will be used.
sampling_percentageThe percentage of records sampled to train a tree. If it's NULL, 0.632 bootstrap will be used
continuous_feature_namesA comma-separated list of the names of the features whose values are continuous. NULL means there are no continuous features.
feature_col_namesA comma-separated list of names of the table columns, each of which defines a feature. NULL means all the columns except the ID and Class columns will be treated as features.
id_col_nameThe name of the column containing id of each record. It can't be NULL.
class_col_nameThe name of the column containing correct class of each record. It can't be NULL.
how2handle_missing_valueThe way to handle missing value. The valid values are 'explicit' and 'ignore'. It can't be NULL.
max_tree_depthThe maximum tree depth. It can't be NULL.
node_prune_thresholdThe minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than or equal to the value of this parameter, then the tree only has one node (the root node); if its value is 1, then the percentage p is less than or equal to 1 definitely. Therefore, the tree only has one node (the root node). if its value is 0, then no nodes will be pruned by this parameter.
node_split_thresholdThe minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than the value of this parameter, then the root node will be a leaf one. Therefore, the trained tree only has one node. If the percentage p is equal to the value of this parameter, then the trained tree only has two levels, since only the root node will grow. (the root node); if its value is 0, then trees can grow extensively.
verbosity> 0 means this function runs in verbose mode. It can't be NULL.
Returns
An rf_train_result object.

Definition at line 540 of file rf.sql_in.

rf_train_result rf_train ( text  split_criterion,
text  training_table_name,
text  result_rf_table_name 
)
Parameters
split_criterionThe split criterion used for tree construction. The valid values are infogain, gainratio, or gini. It can't be NULL.
training_table_nameThe name of the table/view with the training data. It can't be NULL and must exist.
result_rf_table_nameThe name of the table where the resulting trees will be stored. It can't be NULL and must not exist.
Returns
An rf_train_result object.

Definition at line 665 of file rf.sql_in.