MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file bayes.sql_in 00004 * 00005 * @brief SQL functions for naive Bayes 00006 * @date January 2011 00007 * 00008 * @sa For a brief introduction to Naive Bayes Classification, see the module 00009 * description \ref grp_bayes. 00010 * 00011 *//* ----------------------------------------------------------------------- */ 00012 00013 m4_include(`SQLCommon.m4') 00014 00015 /** 00016 @addtogroup grp_bayes 00017 00018 @about 00019 00020 Naive Bayes refers to a stochastic model where all independent variables 00021 \f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context) 00022 independently contribute to the probability that a data point belongs to a 00023 certain class \f$ c \f$. In detail, \b Bayes' theorem states that 00024 \f[ 00025 \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n) 00026 = \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)} 00027 {\Pr(A_1 = a_1, \dots, A_n = a_n)} 00028 \,, 00029 \f] 00030 and the \b naive assumption is that 00031 \f[ 00032 \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c) 00033 = \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) 00034 \,. 00035 \f] 00036 Naives Bayes classification estimates feature probabilities and class priors 00037 using maximum likelihood or Laplacian smoothing. These parameters are then used 00038 to classifying new data. 00039 00040 A Naive Bayes classifier computes the following formula: 00041 \f[ 00042 \text{classify}(a_1, ..., a_n) 00043 = \arg\max_c \left\{ 00044 \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) 00045 \right\} 00046 \f] 00047 where \f$ c \f$ ranges over all classes in the training data and probabilites 00048 are estimated with relative frequencies from the training set. 00049 There are different ways to estimate the feature probabilities 00050 \f$ P(A_i = a \mid C = c) \f$. The maximum likelihood estimate takes the 00051 relative frequencies. That is: 00052 \f[ 00053 P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c} 00054 \f] 00055 where 00056 - \f$ \#(c,i,a) \f$ denotes the # of training samples where attribute \f$ i \f$ 00057 is \f$ a \f$ and class is \f$ c \f$ 00058 - \f$ \#c \f$ denotes the # of training samples where class is \f$ c \f$. 00059 00060 Since the maximum likelihood sometimes results in estimates of "0", you might 00061 want to use a "smoothed" estimate. To do this, you add a number of "virtual" 00062 samples and make the assumption that these samples are evenly distributed among 00063 the values assumed by attribute \f$ i \f$ (that is, the set of all values 00064 observed for attribute \f$ a \f$ for any class): 00065 00066 \f[ 00067 P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i} 00068 \f] 00069 where 00070 - \f$ \#i \f$ denotes the # of distinct values for attribute \f$ i \f$ (for all 00071 classes) 00072 - \f$ s \geq 0 \f$ denotes the smoothing factor. 00073 00074 The case \f$ s = 1 \f$ is known as "Laplace smoothing". The case \f$ s = 0 \f$ 00075 trivially reduces to maximum-likelihood estimates. 00076 00077 \b Note: 00078 (1) The probabilities computed on the platforms of PostgreSQL and Greenplum 00079 database have a small difference due to the nature of floating point 00080 computation. Usually this is not important. However, if a data point has 00081 \f[ 00082 P(C=c_i \mid A) \approx P(C=c_j \mid A) 00083 \f] 00084 for two classes, this data point might be classified into diferent classes on 00085 PostgreSQL and Greenplum. This leads to the differences in classifications 00086 on PostgreSQL and Greenplum for some data sets, but this should not 00087 affect the quality of the results. 00088 00089 (2) When two classes have equal and highest probability among all classes, 00090 the classification result is an array of these two classes, but the order 00091 of the two classes is random. 00092 00093 (3) The current implementation of Naive Bayes classification is only suitable 00094 for discontinuous (categorial) attributes. 00095 00096 For continuous data, a typical assumption, usually used for small datasets, 00097 is that the continuous values associated with each class are distributed 00098 according to a Gaussian distribution, 00099 and then the probabilities \f$ P(A_i = a \mid C=c) \f$ can be estimated. 00100 Another common technique for handling continuous values, which is better for 00101 large data sets, is to use binning to discretize the values, and convert the 00102 continuous data into categorical bins. These approaches are currently not 00103 implemented and planned for future releases. 00104 00105 (4) One can still provide floating point data to the naive Bayes 00106 classification function. Floating point numbers can be used as symbolic 00107 substitutions for categorial data. The classification would work best if 00108 there are sufficient data points for each floating point attribute. However, 00109 if floating point numbers are used as continuous data, no warning is raised and 00110 the result may not be as expected. 00111 00112 @input 00113 00114 The <b>training data</b> is expected to be of the following form: 00115 <pre>{TABLE|VIEW} <em>trainingSource</em> ( 00116 ... 00117 <em>trainingClassColumn</em> INTEGER, 00118 <em>trainingAttrColumn</em> INTEGER[], 00119 ... 00120 )</pre> 00121 00122 The <b>data to classify</b> is expected to be of the following form: 00123 <pre>{TABLE|VIEW} <em>classifySource</em> ( 00124 ... 00125 <em>classifyKeyColumn</em> ANYTYPE, 00126 <em>classifyAttrColumn</em> INTEGER[], 00127 ... 00128 )</pre> 00129 00130 @usage 00131 00132 - Precompute feature probabilities and class priors: 00133 <pre>SELECT \ref create_nb_prepared_data_tables( 00134 '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>', 00135 <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>' 00136 );</pre> 00137 This creates table <em>featureProbsName</em> for storing feature 00138 probabilities and table <em>classPriorsName</em> for storing the class priors. 00139 - Perform Naive Bayes classification: 00140 <pre>SELECT \ref create_nb_classify_view( 00141 '<em>featureProbsName</em>', '<em>classPriorsName</em>', 00142 '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>', 00143 <em>numAttrs</em>, '<em>destName</em>' 00144 );</pre> 00145 This creates the view <tt><em>destName</em></tt> mapping 00146 <em>classifyKeyColumn</em> to the Naive Bayes classification: 00147 <pre>key | nb_classification 00148 ----+------------------ 00149 ...</pre> 00150 - Compute Naive Bayes probabilities: 00151 <pre>SELECT \ref create_nb_probs_view( 00152 '<em>featureProbsName</em>', '<em>classPriorsName</em>', 00153 '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>', 00154 <em>numAttrs</em>, '<em>destName</em>' 00155 );</pre> 00156 This creates the view <tt><em>destName</em></tt> mapping 00157 <em>classifyKeyColumn</em> and every single class to the Naive Bayes 00158 probability: 00159 <pre>key | class | nb_prob 00160 ----+-------+-------- 00161 ...</pre> 00162 - Ad-hoc execution (no precomputation): 00163 Functions \ref create_nb_classify_view and 00164 \ref create_nb_probs_view can be used in an ad-hoc fashion without the above 00165 precomputation step. In this case, replace the function arguments 00166 <pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre> 00167 with 00168 <pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre> 00169 00170 @examp 00171 00172 The following is an extremely simplified example of the above option #1 which 00173 can by verified by hand. 00174 00175 -# The training and the classification data: 00176 \verbatim 00177 sql> SELECT * FROM training; 00178 id | class | attributes 00179 ----+-------+------------ 00180 1 | 1 | {1,2,3} 00181 2 | 1 | {1,2,1} 00182 3 | 1 | {1,4,3} 00183 4 | 2 | {1,2,2} 00184 5 | 2 | {0,2,2} 00185 6 | 2 | {0,1,3} 00186 (6 rows) 00187 00188 sql> select * from toclassify; 00189 id | attributes 00190 ----+------------ 00191 1 | {0,2,1} 00192 2 | {1,2,3} 00193 (2 rows) 00194 \endverbatim 00195 -# Precompute feature probabilities and class priors 00196 \verbatim 00197 sql> SELECT madlib.create_nb_prepared_data_tables( 00198 'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors'); 00199 \endverbatim 00200 -# Optionally check the contents of the precomputed tables: 00201 \verbatim 00202 sql> SELECT * FROM nb_class_priors; 00203 class | class_cnt | all_cnt 00204 -------+-----------+--------- 00205 1 | 3 | 6 00206 2 | 3 | 6 00207 (2 rows) 00208 00209 sql> SELECT * FROM nb_feature_probs; 00210 class | attr | value | cnt | attr_cnt 00211 -------+------+-------+-----+---------- 00212 1 | 1 | 0 | 0 | 2 00213 1 | 1 | 1 | 3 | 2 00214 1 | 2 | 1 | 0 | 3 00215 1 | 2 | 2 | 2 | 3 00216 ... 00217 \endverbatim 00218 -# Create the view with Naive Bayes classification and check the results: 00219 \verbatim 00220 sql> SELECT madlib.create_nb_classify_view ( 00221 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast'); 00222 00223 sql> SELECT * FROM nb_classify_view_fast; 00224 key | nb_classification 00225 -----+------------------- 00226 1 | {2} 00227 2 | {1} 00228 (2 rows) 00229 \endverbatim 00230 -# Look at the probabilities for each class (note that we use "Laplacian smoothing"): 00231 \verbatim 00232 sql> SELECT madlib.create_nb_probs_view ( 00233 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast'); 00234 00235 sql> SELECT * FROM nb_probs_view_fast; 00236 key | class | nb_prob 00237 -----+-------+--------- 00238 1 | 1 | 0.4 00239 1 | 2 | 0.6 00240 2 | 1 | 0.75 00241 2 | 2 | 0.25 00242 (4 rows) 00243 \endverbatim 00244 00245 @literature 00246 00247 [1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter 00248 <em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic 00249 Regression</em> available at: http://www.cs.cmu.edu/~tom/NewChapters.html 00250 00251 [2] Wikipedia, Naive Bayes classifier, 00252 http://en.wikipedia.org/wiki/Naive_Bayes_classifier 00253 00254 @sa File bayes.sql_in documenting the SQL functions. 00255 00256 @internal 00257 @sa namespace bayes (documenting the implementation in Python) 00258 @endinternal 00259 00260 */ 00261 00262 -- Begin of argmax definition 00263 00264 CREATE TYPE MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS ( 00265 args INTEGER[], 00266 value DOUBLE PRECISION 00267 ); 00268 00269 CREATE FUNCTION MADLIB_SCHEMA.argmax_transition( 00270 oldmax MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE, 00271 newkey INTEGER, 00272 newvalue DOUBLE PRECISION) 00273 RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS 00274 $$ 00275 SELECT CASE WHEN $3 < $1.value OR $2 IS NULL OR ($3 IS NULL AND NOT $1.value IS NULL) THEN $1 00276 WHEN $3 = $1.value OR ($3 IS NULL AND $1.value IS NULL AND NOT $1.args IS NULL) 00277 THEN ($1.args || $2, $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE 00278 ELSE (array[$2], $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE 00279 END 00280 $$ 00281 LANGUAGE sql IMMUTABLE; 00282 00283 CREATE FUNCTION MADLIB_SCHEMA.argmax_combine( 00284 max1 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE, 00285 max2 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE) 00286 RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS 00287 $$ 00288 -- If SQL guaranteed short-circuit evaluation, the following could become 00289 -- shorter. Unfortunately, this is not the case. 00290 -- Section 6.3.3.3 of ISO/IEC 9075-1:2008 Framework (SQL/Framework): 00291 -- 00292 -- "However, it is implementation-dependent whether expressions are 00293 -- actually evaluated left to right, particularly when operands or 00294 -- operators might cause conditions to be raised or if the results of the 00295 -- expressions can be determined without completely evaluating all parts 00296 -- of the expression." 00297 -- 00298 -- Again, the optimizer does its job hopefully. 00299 SELECT CASE WHEN $1 IS NULL THEN $2 00300 WHEN $2 IS NULL THEN $1 00301 WHEN ($1.value = $2.value) OR ($1.value IS NULL AND $2.value IS NULL) 00302 THEN ($1.args || $2.args, $1.value)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE 00303 WHEN $1.value IS NULL OR $1.value < $2.value THEN $2 00304 ELSE $1 00305 END 00306 $$ 00307 LANGUAGE sql IMMUTABLE; 00308 00309 CREATE FUNCTION MADLIB_SCHEMA.argmax_final( 00310 finalstate MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE) 00311 RETURNS INTEGER[] AS 00312 $$ 00313 SELECT $1.args 00314 $$ 00315 LANGUAGE sql IMMUTABLE; 00316 00317 /** 00318 * @internal 00319 * @brief Argmax: Return the key of the row for which value is maximal 00320 * 00321 * The "index set" of the argmax function is of type INTEGER and we range over 00322 * DOUBLE PRECISION values. It is not required that all keys are distinct. 00323 * 00324 * @note 00325 * argmax should only be used on unsorted data because it will not exploit 00326 * indices, and its running time is \f$ \Theta(n) \f$. 00327 * 00328 * @implementation 00329 * The implementation is in SQL, with a flavor of functional programming. 00330 * The hope is that the optimizer does a good job here. 00331 */ 00332 CREATE AGGREGATE MADLIB_SCHEMA.argmax(/*+ key */ INTEGER, /*+ value */ DOUBLE PRECISION) ( 00333 SFUNC=MADLIB_SCHEMA.argmax_transition, 00334 STYPE=MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE, 00335 m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.argmax_combine,') 00336 FINALFUNC=MADLIB_SCHEMA.argmax_final 00337 ); 00338 00339 00340 /** 00341 * @brief Precompute all class priors and feature probabilities 00342 * 00343 * Feature probabilities are stored in a table of format 00344 * <pre>TABLE <em>featureProbsDestName</em> ( 00345 * class INTEGER, 00346 * attr INTEGER, 00347 * value INTEGER, 00348 * cnt INTEGER, 00349 * attr_cnt INTEGER 00350 *)</pre> 00351 * 00352 * Class priors are stored in a table of format 00353 * <pre>TABLE <em>classPriorsDestName</em> ( 00354 * class INTEGER, 00355 * class_cnt INTEGER, 00356 * all_cnt INTEGER 00357 *)</pre> 00358 * 00359 * @param trainingSource Name of relation containing the training data 00360 * @param trainingClassColumn Name of class column in training data 00361 * @param trainingAttrColumn Name of attributes-array column in training data 00362 * @param numAttrs Number of attributes to use for classification 00363 * @param featureProbsDestName Name of feature-probabilities table to create 00364 * @param classPriorsDestName Name of class-priors table to create 00365 * 00366 * @usage 00367 * Precompute feature probabilities and class priors: 00368 * <pre>SELECT \ref create_nb_prepared_data_tables( 00369 * '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>', 00370 * <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>' 00371 *);</pre> 00372 * 00373 * @internal 00374 * @sa This function is a wrapper for bayes::create_prepared_data(). 00375 */ 00376 CREATE FUNCTION MADLIB_SCHEMA.create_nb_prepared_data_tables( 00377 "trainingSource" VARCHAR, 00378 "trainingClassColumn" VARCHAR, 00379 "trainingAttrColumn" VARCHAR, 00380 "numAttrs" INTEGER, 00381 "featureProbsDestName" VARCHAR, 00382 "classPriorsDestName" VARCHAR) 00383 RETURNS VOID 00384 AS $$PythonFunction(bayes, bayes, create_prepared_data_table)$$ 00385 LANGUAGE plpythonu VOLATILE; 00386 00387 /** 00388 * @brief Create a view with columns <tt>(key, nb_classification)</tt> 00389 * 00390 * The created relation will be 00391 * 00392 * <tt>{TABLE|VIEW} <em>destName</em> (key, nb_classification)</tt> 00393 * 00394 * where \c nb_classification is an array containing the most likely 00395 * class(es) of the record in \em classifySource identified by \c key. 00396 * 00397 * @param featureProbsSource Name of table with precomputed feature 00398 * probabilities, as created with create_nb_prepared_data_tables() 00399 * @param classPriorsSource Name of table with precomputed class priors, as 00400 * created with create_nb_prepared_data_tables() 00401 * @param classifySource Name of the relation that contains data to be classified 00402 * @param classifyKeyColumn Name of column in \em classifySource that can 00403 * serve as unique identifier (the key of the source relation) 00404 * @param classifyAttrColumn Name of attributes-array column in \em classifySource 00405 * @param numAttrs Number of attributes to use for classification 00406 * @param destName Name of the view to create 00407 * 00408 * @note \c create_nb_classify_view can be called in an ad-hoc fashion. See 00409 * \ref grp_bayes for instructions. 00410 * 00411 * @usage 00412 * -# Create Naive Bayes classifications view: 00413 * <pre>SELECT \ref create_nb_classify_view( 00414 * '<em>featureProbsName</em>', '<em>classPriorsName</em>', 00415 * '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>', 00416 * <em>numAttrs</em>, '<em>destName</em>' 00417 *);</pre> 00418 * -# Show Naive Bayes classifications: 00419 * <pre>SELECT * FROM <em>destName</em>;</pre> 00420 * 00421 * @internal 00422 * @sa This function is a wrapper for bayes::create_classification(). See there 00423 * for details. 00424 */ 00425 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view( 00426 "featureProbsSource" VARCHAR, 00427 "classPriorsSource" VARCHAR, 00428 "classifySource" VARCHAR, 00429 "classifyKeyColumn" VARCHAR, 00430 "classifyAttrColumn" VARCHAR, 00431 "numAttrs" INTEGER, 00432 "destName" VARCHAR) 00433 RETURNS VOID 00434 AS $$PythonFunction(bayes, bayes, create_classification_view)$$ 00435 LANGUAGE plpythonu VOLATILE; 00436 00437 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view( 00438 "trainingSource" VARCHAR, 00439 "trainingClassColumn" VARCHAR, 00440 "trainingAttrColumn" VARCHAR, 00441 "classifySource" VARCHAR, 00442 "classifyKeyColumn" VARCHAR, 00443 "classifyAttrColumn" VARCHAR, 00444 "numAttrs" INTEGER, 00445 "destName" VARCHAR) 00446 RETURNS VOID 00447 AS $$PythonFunction(bayes, bayes, create_classification_view)$$ 00448 LANGUAGE plpythonu VOLATILE; 00449 00450 00451 /** 00452 * @brief Create view with columns <tt>(key, class, nb_prob)</tt> 00453 * 00454 * The created view will be of the following form: 00455 * 00456 * <pre>VIEW <em>destName</em> ( 00457 * key ANYTYPE, 00458 * class INTEGER, 00459 * nb_prob FLOAT8 00460 *)</pre> 00461 * 00462 * where \c nb_prob is the Naive-Bayes probability that \c class is the true 00463 * class of the record in \em classifySource identified by \c key. 00464 * 00465 * @param featureProbsSource Name of table with precomputed feature 00466 * probabilities, as created with create_nb_prepared_data_tables() 00467 * @param classPriorsSource Name of table with precomputed class priors, as 00468 * created with create_nb_prepared_data_tables() 00469 * @param classifySource Name of the relation that contains data to be classified 00470 * @param classifyKeyColumn Name of column in \em classifySource that can 00471 * serve as unique identifier (the key of the source relation) 00472 * @param classifyAttrColumn Name of attributes-array column in \em classifySource 00473 * @param numAttrs Number of attributes to use for classification 00474 * @param destName Name of the view to create 00475 * 00476 * @note \c create_nb_probs_view can be called in an ad-hoc fashion. See 00477 * \ref grp_bayes for instructions. 00478 * 00479 * @usage 00480 * -# Create Naive Bayes probabilities view: 00481 * <pre>SELECT \ref create_nb_probs_view( 00482 * '<em>featureProbsName</em>', '<em>classPriorsName</em>', 00483 * '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>', 00484 * <em>numAttrs</em>, '<em>destName</em>' 00485 *);</pre> 00486 * -# Show Naive Bayes probabilities: 00487 * <pre>SELECT * FROM <em>destName</em>;</pre> 00488 * 00489 * @internal 00490 * @sa This function is a wrapper for bayes::create_bayes_probabilities(). 00491 */ 00492 CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view( 00493 "featureProbsSource" VARCHAR, 00494 "classPriorsSource" VARCHAR, 00495 "classifySource" VARCHAR, 00496 "classifyKeyColumn" VARCHAR, 00497 "classifyAttrColumn" VARCHAR, 00498 "numAttrs" INTEGER, 00499 "destName" VARCHAR) 00500 RETURNS VOID 00501 AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$ 00502 LANGUAGE plpythonu VOLATILE; 00503 00504 CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view( 00505 "trainingSource" VARCHAR, 00506 "trainingClassColumn" VARCHAR, 00507 "trainingAttrColumn" VARCHAR, 00508 "classifySource" VARCHAR, 00509 "classifyKeyColumn" VARCHAR, 00510 "classifyAttrColumn" VARCHAR, 00511 "numAttrs" INTEGER, 00512 "destName" VARCHAR) 00513 RETURNS VOID 00514 AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$ 00515 LANGUAGE plpythonu VOLATILE; 00516 00517 00518 /** 00519 * @brief Create a SQL function mapping arrays of attribute values to the Naive 00520 * Bayes classification. 00521 * 00522 * The created SQL function is bound to the given feature probabilities and 00523 * class priors. Its declaration will be: 00524 * 00525 * <tt> 00526 * FUNCTION <em>destName</em> (attributes INTEGER[], smoothingFactor DOUBLE PRECISION) 00527 * RETURNS INTEGER[]</tt> 00528 * 00529 * The return type is \c INTEGER[] because the Naive Bayes classification might 00530 * be ambiguous (in which case all of the most likely candiates are returned). 00531 * 00532 * @param featureProbsSource Name of table with precomputed feature 00533 * probabilities, as created with create_nb_prepared_data_tables() 00534 * @param classPriorsSource Name of table with precomputed class priors, as 00535 * created with create_nb_prepared_data_tables() 00536 * @param numAttrs Number of attributes to use for classification 00537 * @param destName Name of the function to create 00538 * 00539 * @note 00540 * Just like \ref create_nb_classify_view and \ref create_nb_probs_view, 00541 * also \c create_nb_classify_fn can be called in an ad-hoc fashion. See 00542 * \ref grp_bayes for instructions. 00543 * 00544 * @usage 00545 * -# Create classification function: 00546 * <pre>SELECT create_nb_classify_fn( 00547 * '<em>featureProbsSource</em>', '<em>classPriorsSource</em>', 00548 * <em>numAttrs</em>, '<em>destName</em>' 00549 *);</pre> 00550 * -# Run classification function: 00551 * <pre>SELECT <em>destName</em>(<em>attributes</em>, <em>smoothingFactor</em>);</pre> 00552 * 00553 * @note 00554 * On Greenplum, the generated SQL function can only be called on the master. 00555 * 00556 * @internal 00557 * @sa This function is a wrapper for bayes::create_classification_function(). 00558 */ 00559 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn( 00560 "featureProbsSource" VARCHAR, 00561 "classPriorsSource" VARCHAR, 00562 "numAttrs" INTEGER, 00563 "destName" VARCHAR) 00564 RETURNS VOID 00565 AS $$PythonFunction(bayes, bayes, create_classification_function)$$ 00566 LANGUAGE plpythonu VOLATILE; 00567 00568 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn( 00569 "trainingSource" VARCHAR, 00570 "trainingClassColumn" VARCHAR, 00571 "trainingAttrColumn" VARCHAR, 00572 "numAttrs" INTEGER, 00573 "destName" VARCHAR) 00574 RETURNS VOID 00575 AS $$PythonFunction(bayes, bayes, create_classification_function)$$ 00576 LANGUAGE plpythonu VOLATILE;