random forest APIs and main control logic written in PL/PGSQL More...

Functions
rf_train_result	rf_train (text split_criterion, text training_table_name, text result_rf_table_name, int num_trees, int features_per_node, float sampling_percentage, text continuous_feature_names, text feature_col_names, text id_col_name, text class_col_name, text how2handle_missing_value, int max_tree_depth, float node_prune_threshold, float node_split_threshold, int verbosity)
	This API is defined for training a random forest. The training function provides a number of parameters that enables more flexible controls on how an RF is generated. It constructs the RF based on a training set stored in a database table, each row of which defines a set of features, an ID, and a labeled class. Features could be either discrete or continuous. All the DTs of the result RF will be kept in a single table. More...

rf_train_result	rf_train (text split_criterion, text training_table_name, text result_rf_table_name)
	This API (short form) is defined for training a random forest. For convenience, a short form of the training API with three parameters is also defined. This one needs only the split criterion name, the name of the table where training data is kept, and the name of the table where the trained RF should be kept. All other parameters in the full form will take their default values. More...

set< text >	rf_display (text rf_table_name, int[] tree_id, int max_depth)
	Display the trees in the random forest with human readable format. More...

set< text >	rf_display (text rf_table_name, int[] tree_id)
	Display the trees in the random forest with human readable format. This function displays all the levels of these specified trees. More...

set< text >	rf_display (text rf_table_name)
	Display the trees in the random forest with human readable format. This function displays all the levels of all trees in RF. More...

rf_classify_result	rf_classify (text rf_table_name, text classification_table_name, text result_table_name, boolean is_serial_classification, int verbosity)
	Classify dataset using a trained RF. More...

rf_classify_result	rf_classify (text rf_table_name, text classification_table_name, text result_table_name, int verbosity)
	Classify dataset using a trained RF. This function does the same thing as the full version defined as above except that it will only use parallel classification. More...

rf_classify_result	rf_classify (text rf_table_name, text classification_table_name, text result_table_name)
	Classify dataset using a trained RF. This function does the same thing as the full version defined as above except that it will only use parallel classification and run in quiet mode. More...

float8	rf_score (text rf_table_name, text scoring_table_name, int verbosity)
	Check the accuracy of a trained RF with a scoring set. More...

float8	rf_score (text rf_table_name, text scoring_table_name)
	Check the accuracy of a trained RF with a scoring set in quiet mode. More...

boolean	rf_clean (text rf_table_name)
	Cleanup the trained random forest table and any relevant tables. More...

Detailed Description

Date: April 5, 2012

Definition in file rf.sql_in.

Function Documentation

rf_classify_result rf_classify	(	text	rf_table_name,
		text	classification_table_name,
		text	result_table_name,
		boolean	is_serial_classification,
		int	verbosity
	)

The classification result will be stored in the table which is defined as: CREATE TABLE classification_result ( id INT|BIGINT, class SUPPORTED_DATA_TYPE, prob FLOAT );

Parameters

rf_table_name	The name of RF table. It can't be NULL.
classification_table_name	The name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_name	The name of result table. It can't be NULL and must exist.
is_serial_classification	Whether classify with all trees at a time or one by one. It can't be NULL.
verbosity	> 0 means this function runs in verbose mode. It can't be NULL.

Returns: A rf_classify_result object.

Definition at line 886 of file rf.sql_in.

rf_classify_result rf_classify	(	text	rf_table_name,
		text	classification_table_name,
		text	result_table_name,
		int	verbosity
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL.
classification_table_name	The name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_name	The name of result table. It can't be NULL and must exist.
verbosity	> 0 means this function runs in verbose mode. It can't be NULL.

Returns: A rf_classify_result object.

Definition at line 1006 of file rf.sql_in.

rf_classify_result rf_classify	(	text	rf_table_name,
		text	classification_table_name,
		text	result_table_name
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL.
classification_table_name	The name of the table/view that keeps the data to be classified. It can't be NULL and must exist.
result_table_name	The name of result table. It can't be NULL and must exist.

Returns: A rf_classify_result object.

Definition at line 1044 of file rf.sql_in.

boolean rf_clean ( text rf_table_name)

Parameters

rf_table_name The name of RF table. It can't be NULL.

Returns: The status of that cleanup operation.

Definition at line 1127 of file rf.sql_in.

set<text> rf_display	(	text	rf_table_name,
		int[]	tree_id,
		int	max_depth
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL and must exist.
tree_id	The trees to be displayed. If it's NULL, we display all the trees.
max_depth	The max depth to be displayed. If It's NULL, this function will show all levels.

Returns: The text representing the trees in random forest with human readable format.

Definition at line 724 of file rf.sql_in.

set<text> rf_display	(	text	rf_table_name,
		int[]	tree_id
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL and must exist.
tree_id	The trees to be displayed. If it's NULL, we display all the trees.

Returns: The text representing the trees in random forest with human readable format.

Definition at line 817 of file rf.sql_in.

set<text> rf_display ( text rf_table_name)

Parameters

rf_table_name The name of RF table. It can't be NULL and must exist.

Returns: The text representing the trees in random forest with human readable format.

Definition at line 845 of file rf.sql_in.

float8 rf_score	(	text	rf_table_name,
		text	scoring_table_name,
		int	verbosity
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL.
scoring_table_name	The name of the table/view that keeps the data to be scored. It can't be NULL and must exist.
verbosity	> 0 means this function runs in verbose mode. It can't be NULL.

Returns: The estimated accuracy information.

Definition at line 1079 of file rf.sql_in.

float8 rf_score	(	text	rf_table_name,
		text	scoring_table_name
	)

Parameters

rf_table_name	The name of RF table. It can't be NULL.
scoring_table_name	The name of the table/view that keeps the data to be scored. It can't be NULL and must exist.

Returns: The estimated accuracy information.

Definition at line 1107 of file rf.sql_in.

rf_train_result rf_train	(	text	split_criterion,
		text	training_table_name,
		text	result_rf_table_name,
		int	num_trees,
		int	features_per_node,
		float	sampling_percentage,
		text	continuous_feature_names,
		text	feature_col_names,
		text	id_col_name,
		text	class_col_name,
		text	how2handle_missing_value,
		int	max_tree_depth,
		float	node_prune_threshold,
		float	node_split_threshold,
		int	verbosity
	)

We discretize continuous features on local regions during training rather than discretizing on the whole dataset prior to training because local discretization takes into account the context sensitivity.

Parameters

split_criterion	The name of the split criterion that should be used for tree construction. The valid values are ‘infogain’, ‘gainratio’, and ‘gini’. It can't be NULL. Information gain(infogain) and gini index(gini) are biased toward multivalued attributes. Gain ratio(gainratio) adjusts for this bias. However, it tends to prefer unbalanced splits in which one partition is much smaller than the others.
training_table_name	The name of the table/view with the training data. It can't be NULL and must exist.
result_rf_table_name	The name of the table where the resulting trees will be stored. It can't be NULL and must not exist.
num_trees	The number of trees to be trained. If it's NULL, 10 will be used.
features_per_node	The number of features to be considered when finding a best split. If it's NULL, sqrt(p), where p is the number of features, will be used.
sampling_percentage	The percentage of records sampled to train a tree. If it's NULL, 0.632 bootstrap will be used
continuous_feature_names	A comma-separated list of the names of the features whose values are continuous. NULL means there are no continuous features.
feature_col_names	A comma-separated list of names of the table columns, each of which defines a feature. NULL means all the columns except the ID and Class columns will be treated as features.
id_col_name	The name of the column containing id of each record. It can't be NULL.
class_col_name	The name of the column containing correct class of each record. It can't be NULL.
how2handle_missing_value	The way to handle missing value. The valid values are 'explicit' and 'ignore'. It can't be NULL.
max_tree_depth	The maximum tree depth. It can't be NULL.
node_prune_threshold	The minimum percentage of the number of records required in a child node. It can't be NULL. The range of it is in [0.0, 1.0]. This threshold only applies to the non-root nodes. Therefore, if the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than or equal to the value of this parameter, then the tree only has one node (the root node); if its value is 1, then the percentage p is less than or equal to 1 definitely. Therefore, the tree only has one node (the root node). if its value is 0, then no nodes will be pruned by this parameter.
node_split_threshold	The minimum percentage of the number of records required in a node in order for a further split to be possible. It can't be NULL. The range of it is in [0.0, 1.0]. If the percentage(p) between the sampled training set size of a tree (the number of rows) and the total training set size is less than the value of this parameter, then the root node will be a leaf one. Therefore, the trained tree only has one node. If the percentage p is equal to the value of this parameter, then the trained tree only has two levels, since only the root node will grow. (the root node); if its value is 0, then trees can grow extensively.
verbosity	> 0 means this function runs in verbose mode. It can't be NULL.

Returns: An rf_train_result object.

Definition at line 540 of file rf.sql_in.

rf_train_result rf_train	(	text	split_criterion,
		text	training_table_name,
		text	result_rf_table_name
	)

Parameters

split_criterion	The split criterion used for tree construction. The valid values are infogain, gainratio, or gini. It can't be NULL.
training_table_name	The name of the table/view with the training data. It can't be NULL and must exist.
result_rf_table_name	The name of the table where the resulting trees will be stored. It can't be NULL and must not exist.

Returns: An rf_train_result object.

Definition at line 665 of file rf.sql_in.

Functions

Detailed Description

Function Documentation