User Documentation
bayes.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file bayes.sql_in
00004  *
00005  * @brief SQL functions for naive Bayes
00006  * @date   January 2011
00007  *
00008  * @sa For a brief introduction to Naive Bayes Classification, see the module
00009  *     description \ref grp_bayes.
00010  *
00011  *//* ----------------------------------------------------------------------- */
00012 
00013 m4_include(`SQLCommon.m4')
00014 
00015 /**
00016 @addtogroup grp_bayes
00017 
00018 @about
00019 
00020 Naive Bayes refers to a stochastic model where all independent variables
00021 \f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context)
00022 independently contribute to the probability that a data point belongs to a
00023 certain class \f$ c \f$. In detail, \b Bayes' theorem states that
00024 \f[
00025     \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n)
00026     =   \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)}
00027              {\Pr(A_1 = a_1, \dots, A_n = a_n)}
00028     \,,
00029 \f]
00030 and the \b naive assumption is that
00031 \f[
00032     \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)
00033     =   \prod_{i=1}^n \Pr(A_i = a_i \mid C = c)
00034     \,.
00035 \f]
00036 Naives Bayes classification estimates feature probabilities and class priors
00037 using maximum likelihood or Laplacian smoothing. These parameters are then used
00038 to classifying new data.
00039 
00040 A Naive Bayes classifier computes the following formula:
00041 \f[
00042     \text{classify}(a_1, ..., a_n)
00043     =   \arg\max_c \left\{
00044             \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c)
00045         \right\}
00046 \f]
00047 where \f$ c \f$ ranges over all classes in the training data and probabilites
00048 are estimated with relative frequencies from the training set.
00049 There are different ways to estimate the feature probabilities
00050 \f$ P(A_i = a \mid C = c) \f$.  The maximum likelihood estimate takes the
00051 relative frequencies. That is:
00052 \f[
00053     P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c}
00054 \f]
00055 where
00056 - \f$ \#(c,i,a) \f$ denotes the # of training samples where attribute \f$ i \f$
00057   is \f$ a \f$ and class is \f$ c \f$
00058 - \f$ \#c \f$ denotes the # of training samples where class is \f$ c \f$.
00059 
00060 Since the maximum likelihood sometimes results in estimates of "0", you might
00061 want to use a "smoothed" estimate. To do this, you add a number of "virtual"
00062 samples and make the assumption that these samples are evenly distributed among
00063 the values assumed by attribute \f$ i \f$ (that is, the set of all values
00064 observed for attribute \f$ a \f$ for any class):
00065 
00066 \f[
00067     P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i}
00068 \f]
00069 where
00070 - \f$ \#i \f$ denotes the # of distinct values for attribute \f$ i \f$ (for all
00071   classes)
00072 - \f$ s \geq 0 \f$ denotes the smoothing factor.
00073 
00074 The case \f$ s = 1 \f$ is known as "Laplace smoothing". The case \f$ s = 0 \f$
00075 trivially reduces to maximum-likelihood estimates.
00076 
00077 \b Note:
00078 (1) The probabilities computed on the platforms of PostgreSQL and Greenplum
00079 database have a small difference due to the nature of floating point
00080 computation. Usually this is not important. However, if a data point has
00081 \f[
00082 P(C=c_i \mid A) \approx P(C=c_j \mid A)
00083 \f]
00084 for two classes, this data point might be classified into diferent classes on
00085 PostgreSQL and Greenplum. This leads to the differences in classifications
00086 on PostgreSQL and Greenplum for some data sets, but this should not
00087 affect the quality of the results.
00088 
00089 (2) When two classes have equal and highest probability among all classes,
00090 the classification result is an array of these two classes, but the order
00091 of the two classes is random.
00092 
00093 (3) The current implementation of Naive Bayes classification is only suitable
00094 for discontinuous (categorial) attributes.
00095 
00096 For continuous data, a typical assumption, usually used for small datasets,
00097 is that the continuous values associated with each class are distributed
00098 according to a Gaussian distribution,
00099 and then the probabilities \f$ P(A_i = a \mid C=c) \f$ can be estimated.
00100 Another common technique for handling continuous values, which is better for
00101 large data sets, is to use binning to discretize the values, and convert the
00102 continuous data into categorical bins. These approaches are currently not
00103 implemented and planned for future releases.
00104 
00105 (4) One can still provide floating point data to the naive Bayes
00106 classification function. Floating point numbers can be used as symbolic
00107 substitutions for categorial data. The classification would work best if
00108 there are sufficient data points for each floating point attribute. However,
00109 if floating point numbers are used as continuous data, no warning is raised and
00110 the result may not be as expected.
00111 
00112 @input
00113 
00114 The <b>training data</b> is expected to be of the following form:
00115 <pre>{TABLE|VIEW} <em>trainingSource</em> (
00116     ...
00117     <em>trainingClassColumn</em> INTEGER,
00118     <em>trainingAttrColumn</em> INTEGER[],
00119     ...
00120 )</pre>
00121 
00122 The <b>data to classify</b> is expected to be of the following form:
00123 <pre>{TABLE|VIEW} <em>classifySource</em> (
00124     ...
00125     <em>classifyKeyColumn</em> ANYTYPE,
00126     <em>classifyAttrColumn</em> INTEGER[],
00127     ...
00128 )</pre>
00129 
00130 @usage
00131 
00132 - Precompute feature probabilities and class priors:
00133   <pre>SELECT \ref create_nb_prepared_data_tables(
00134     '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',
00135     <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'
00136     );</pre>
00137   This creates table <em>featureProbsName</em> for storing feature
00138   probabilities and table <em>classPriorsName</em> for storing the class priors.
00139 - Perform Naive Bayes classification:
00140   <pre>SELECT \ref create_nb_classify_view(
00141     '<em>featureProbsName</em>', '<em>classPriorsName</em>',
00142     '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
00143     <em>numAttrs</em>, '<em>destName</em>'
00144     );</pre>
00145   This creates the view <tt><em>destName</em></tt> mapping
00146   <em>classifyKeyColumn</em> to the Naive Bayes classification:
00147   <pre>key | nb_classification
00148 ----+------------------
00149 ...</pre>
00150 - Compute Naive Bayes probabilities:
00151   <pre>SELECT \ref create_nb_probs_view(
00152     '<em>featureProbsName</em>', '<em>classPriorsName</em>',
00153     '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
00154     <em>numAttrs</em>, '<em>destName</em>'
00155 );</pre>
00156   This creates the view <tt><em>destName</em></tt> mapping
00157   <em>classifyKeyColumn</em> and every single class to the Naive Bayes
00158   probability:
00159   <pre>key | class | nb_prob
00160 ----+-------+--------
00161 ...</pre>
00162 - Ad-hoc execution (no precomputation):
00163   Functions \ref create_nb_classify_view and
00164   \ref create_nb_probs_view can be used in an ad-hoc fashion without the above
00165   precomputation step. In this case, replace the function arguments
00166   <pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre>
00167   with
00168   <pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre>
00169 
00170 @examp
00171 
00172 The following is an extremely simplified example of the above option #1 which
00173 can by verified by hand.
00174 
00175 -#  The training and the classification data:
00176 \verbatim
00177 sql> SELECT * FROM training;
00178  id | class | attributes
00179 ----+-------+------------
00180   1 |     1 | {1,2,3}
00181   2 |     1 | {1,2,1}
00182   3 |     1 | {1,4,3}
00183   4 |     2 | {1,2,2}
00184   5 |     2 | {0,2,2}
00185   6 |     2 | {0,1,3}
00186 (6 rows)
00187 
00188 sql> select * from toclassify;
00189  id | attributes
00190 ----+------------
00191   1 | {0,2,1}
00192   2 | {1,2,3}
00193 (2 rows)
00194 \endverbatim
00195 -#  Precompute feature probabilities and class priors
00196 \verbatim
00197 sql> SELECT madlib.create_nb_prepared_data_tables(
00198 'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors');
00199 \endverbatim
00200 -#  Optionally check the contents of the precomputed tables:
00201 \verbatim
00202 sql> SELECT * FROM nb_class_priors;
00203  class | class_cnt | all_cnt
00204 -------+-----------+---------
00205      1 |         3 |       6
00206      2 |         3 |       6
00207 (2 rows)
00208 
00209 sql> SELECT * FROM nb_feature_probs;
00210  class | attr | value | cnt | attr_cnt
00211 -------+------+-------+-----+----------
00212      1 |    1 |     0 |   0 |        2
00213      1 |    1 |     1 |   3 |        2
00214      1 |    2 |     1 |   0 |        3
00215      1 |    2 |     2 |   2 |        3
00216 ...
00217 \endverbatim
00218 -#  Create the view with Naive Bayes classification and check the results:
00219 \verbatim
00220 sql> SELECT madlib.create_nb_classify_view (
00221 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast');
00222 
00223 sql> SELECT * FROM nb_classify_view_fast;
00224  key | nb_classification
00225 -----+-------------------
00226    1 | {2}
00227    2 | {1}
00228 (2 rows)
00229 \endverbatim
00230 -#  Look at the probabilities for each class (note that we use "Laplacian smoothing"):
00231 \verbatim
00232 sql> SELECT madlib.create_nb_probs_view (
00233 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast');
00234 
00235 sql> SELECT * FROM nb_probs_view_fast;
00236  key | class | nb_prob
00237 -----+-------+---------
00238    1 |     1 |     0.4
00239    1 |     2 |     0.6
00240    2 |     1 |    0.75
00241    2 |     2 |    0.25
00242 (4 rows)
00243 \endverbatim
00244 
00245 @literature
00246 
00247 [1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter
00248     <em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic
00249     Regression</em> available at: http://www.cs.cmu.edu/~tom/NewChapters.html
00250 
00251 [2] Wikipedia, Naive Bayes classifier,
00252     http://en.wikipedia.org/wiki/Naive_Bayes_classifier
00253 
00254 @sa File bayes.sql_in documenting the SQL functions.
00255 
00256 @internal
00257 @sa namespace bayes (documenting the implementation in Python)
00258 @endinternal
00259 
00260 */
00261 
00262 -- Begin of argmax definition
00263 
00264 CREATE TYPE MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS (
00265     args INTEGER[],
00266     value DOUBLE PRECISION
00267 );
00268 
00269 CREATE FUNCTION MADLIB_SCHEMA.argmax_transition(
00270     oldmax MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,
00271     newkey INTEGER,
00272     newvalue DOUBLE PRECISION)
00273 RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS
00274 $$
00275     SELECT CASE WHEN $3 < $1.value OR $2 IS NULL OR ($3 IS NULL AND NOT $1.value IS NULL) THEN $1
00276                 WHEN $3 = $1.value OR ($3 IS NULL AND $1.value IS NULL AND NOT $1.args IS NULL)
00277                     THEN ($1.args || $2, $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE
00278                 ELSE (array[$2], $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE
00279            END
00280 $$
00281 LANGUAGE sql IMMUTABLE;
00282 
00283 CREATE FUNCTION MADLIB_SCHEMA.argmax_combine(
00284     max1 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,
00285     max2 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE)
00286 RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS
00287 $$
00288     -- If SQL guaranteed short-circuit evaluation, the following could become
00289     -- shorter. Unfortunately, this is not the case.
00290     -- Section 6.3.3.3 of ISO/IEC 9075-1:2008 Framework (SQL/Framework):
00291     --
00292     --  "However, it is implementation-dependent whether expressions are
00293     --   actually evaluated left to right, particularly when operands or
00294     --   operators might cause conditions to be raised or if the results of the
00295     --   expressions can be determined without completely evaluating all parts
00296     --   of the expression."
00297     --
00298     -- Again, the optimizer does its job hopefully.
00299     SELECT CASE WHEN $1 IS NULL THEN $2
00300                 WHEN $2 IS NULL THEN $1
00301                 WHEN ($1.value = $2.value) OR ($1.value IS NULL AND $2.value IS NULL)
00302                     THEN ($1.args || $2.args, $1.value)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE
00303                 WHEN $1.value IS NULL OR $1.value < $2.value THEN $2
00304                 ELSE $1
00305            END
00306 $$
00307 LANGUAGE sql IMMUTABLE;
00308 
00309 CREATE FUNCTION MADLIB_SCHEMA.argmax_final(
00310     finalstate MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE)
00311 RETURNS INTEGER[] AS
00312 $$
00313     SELECT $1.args
00314 $$
00315 LANGUAGE sql IMMUTABLE;
00316 
00317 /**
00318  * @internal
00319  * @brief Argmax: Return the key of the row for which value is maximal
00320  *
00321  * The "index set" of the argmax function is of type INTEGER and we range over
00322  * DOUBLE PRECISION values. It is not required that all keys are distinct.
00323  *
00324  * @note
00325  * argmax should only be used on unsorted data because it will not exploit
00326  * indices, and its running time is \f$ \Theta(n) \f$.
00327  *
00328  * @implementation
00329  * The implementation is in SQL, with a flavor of functional programming.
00330  * The hope is that the optimizer does a good job here.
00331  */
00332 CREATE AGGREGATE MADLIB_SCHEMA.argmax(/*+ key */ INTEGER, /*+ value */ DOUBLE PRECISION) (
00333     SFUNC=MADLIB_SCHEMA.argmax_transition,
00334     STYPE=MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,
00335     m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.argmax_combine,')
00336     FINALFUNC=MADLIB_SCHEMA.argmax_final
00337 );
00338 
00339 
00340 /**
00341  * @brief Precompute all class priors and feature probabilities
00342  *
00343  * Feature probabilities are stored in a table of format
00344  * <pre>TABLE <em>featureProbsDestName</em> (
00345  *    class INTEGER,
00346  *    attr INTEGER,
00347  *    value INTEGER,
00348  *    cnt INTEGER,
00349  *    attr_cnt INTEGER
00350  *)</pre>
00351  *
00352  * Class priors are stored in a table of format
00353  * <pre>TABLE <em>classPriorsDestName</em> (
00354  *    class INTEGER,
00355  *    class_cnt INTEGER,
00356  *    all_cnt INTEGER
00357  *)</pre>
00358  *
00359  * @param trainingSource Name of relation containing the training data
00360  * @param trainingClassColumn Name of class column in training data
00361  * @param trainingAttrColumn Name of attributes-array column in training data
00362  * @param numAttrs Number of attributes to use for classification
00363  * @param featureProbsDestName Name of feature-probabilities table to create
00364  * @param classPriorsDestName Name of class-priors table to create
00365  *
00366  * @usage
00367  * Precompute feature probabilities and class priors:
00368  * <pre>SELECT \ref create_nb_prepared_data_tables(
00369  *    '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',
00370  *    <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'
00371  *);</pre>
00372  *
00373  * @internal
00374  * @sa This function is a wrapper for bayes::create_prepared_data().
00375  */
00376 CREATE FUNCTION MADLIB_SCHEMA.create_nb_prepared_data_tables(
00377     "trainingSource" VARCHAR,
00378     "trainingClassColumn" VARCHAR,
00379     "trainingAttrColumn" VARCHAR,
00380     "numAttrs" INTEGER,
00381     "featureProbsDestName" VARCHAR,
00382     "classPriorsDestName" VARCHAR)
00383 RETURNS VOID
00384 AS $$PythonFunction(bayes, bayes, create_prepared_data_table)$$
00385 LANGUAGE plpythonu VOLATILE;
00386 
00387 /**
00388  * @brief Create a view with columns <tt>(key, nb_classification)</tt>
00389  *
00390  * The created relation will be
00391  *
00392  * <tt>{TABLE|VIEW} <em>destName</em> (key, nb_classification)</tt>
00393  *
00394  * where \c nb_classification is an array containing the most likely
00395  * class(es) of the record in \em classifySource identified by \c key.
00396  *
00397  * @param featureProbsSource Name of table with precomputed feature
00398  *        probabilities, as created with create_nb_prepared_data_tables()
00399  * @param classPriorsSource Name of table with precomputed class priors, as
00400  *        created with create_nb_prepared_data_tables()
00401  * @param classifySource Name of the relation that contains data to be classified
00402  * @param classifyKeyColumn Name of column in \em classifySource that can
00403  *        serve as unique identifier (the key of the source relation)
00404  * @param classifyAttrColumn Name of attributes-array column in \em classifySource
00405  * @param numAttrs Number of attributes to use for classification
00406  * @param destName Name of the view to create
00407  *
00408  * @note \c create_nb_classify_view can be called in an ad-hoc fashion. See
00409  * \ref grp_bayes for instructions.
00410  *
00411  * @usage
00412  * -# Create Naive Bayes classifications view:
00413  *  <pre>SELECT \ref create_nb_classify_view(
00414  *    '<em>featureProbsName</em>', '<em>classPriorsName</em>',
00415  *    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
00416  *    <em>numAttrs</em>, '<em>destName</em>'
00417  *);</pre>
00418  * -# Show Naive Bayes classifications:
00419  *    <pre>SELECT * FROM <em>destName</em>;</pre>
00420  *
00421  * @internal
00422  * @sa This function is a wrapper for bayes::create_classification(). See there
00423  *     for details.
00424  */
00425 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view(
00426     "featureProbsSource" VARCHAR,
00427     "classPriorsSource" VARCHAR,
00428     "classifySource" VARCHAR,
00429     "classifyKeyColumn" VARCHAR,
00430     "classifyAttrColumn" VARCHAR,
00431     "numAttrs" INTEGER,
00432     "destName" VARCHAR)
00433 RETURNS VOID
00434 AS $$PythonFunction(bayes, bayes, create_classification_view)$$
00435 LANGUAGE plpythonu VOLATILE;
00436 
00437 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view(
00438     "trainingSource" VARCHAR,
00439     "trainingClassColumn" VARCHAR,
00440     "trainingAttrColumn" VARCHAR,
00441     "classifySource" VARCHAR,
00442     "classifyKeyColumn" VARCHAR,
00443     "classifyAttrColumn" VARCHAR,
00444     "numAttrs" INTEGER,
00445     "destName" VARCHAR)
00446 RETURNS VOID
00447 AS $$PythonFunction(bayes, bayes, create_classification_view)$$
00448 LANGUAGE plpythonu VOLATILE;
00449 
00450 
00451 /**
00452  * @brief Create view with columns <tt>(key, class, nb_prob)</tt>
00453  *
00454  * The created view will be of the following form:
00455  *
00456  * <pre>VIEW <em>destName</em> (
00457  *    key ANYTYPE,
00458  *    class INTEGER,
00459  *    nb_prob FLOAT8
00460  *)</pre>
00461  *
00462  * where \c nb_prob is the Naive-Bayes probability that \c class is the true
00463  * class of the record in \em classifySource identified by \c key.
00464  *
00465  * @param featureProbsSource Name of table with precomputed feature
00466  *        probabilities, as created with create_nb_prepared_data_tables()
00467  * @param classPriorsSource Name of table with precomputed class priors, as
00468  *        created with create_nb_prepared_data_tables()
00469  * @param classifySource Name of the relation that contains data to be classified
00470  * @param classifyKeyColumn Name of column in \em classifySource that can
00471  *        serve as unique identifier (the key of the source relation)
00472  * @param classifyAttrColumn Name of attributes-array column in \em classifySource
00473  * @param numAttrs Number of attributes to use for classification
00474  * @param destName Name of the view to create
00475  *
00476  * @note \c create_nb_probs_view can be called in an ad-hoc fashion. See
00477  * \ref grp_bayes for instructions.
00478  *
00479  * @usage
00480  * -# Create Naive Bayes probabilities view:
00481  *  <pre>SELECT \ref create_nb_probs_view(
00482  *    '<em>featureProbsName</em>', '<em>classPriorsName</em>',
00483  *    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',
00484  *    <em>numAttrs</em>, '<em>destName</em>'
00485  *);</pre>
00486  * -# Show Naive Bayes probabilities:
00487  *    <pre>SELECT * FROM <em>destName</em>;</pre>
00488  *
00489  * @internal
00490  * @sa This function is a wrapper for bayes::create_bayes_probabilities().
00491  */
00492 CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view(
00493     "featureProbsSource" VARCHAR,
00494     "classPriorsSource" VARCHAR,
00495     "classifySource" VARCHAR,
00496     "classifyKeyColumn" VARCHAR,
00497     "classifyAttrColumn" VARCHAR,
00498     "numAttrs" INTEGER,
00499     "destName" VARCHAR)
00500 RETURNS VOID
00501 AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$
00502 LANGUAGE plpythonu VOLATILE;
00503 
00504 CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view(
00505     "trainingSource" VARCHAR,
00506     "trainingClassColumn" VARCHAR,
00507     "trainingAttrColumn" VARCHAR,
00508     "classifySource" VARCHAR,
00509     "classifyKeyColumn" VARCHAR,
00510     "classifyAttrColumn" VARCHAR,
00511     "numAttrs" INTEGER,
00512     "destName" VARCHAR)
00513 RETURNS VOID
00514 AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$
00515 LANGUAGE plpythonu VOLATILE;
00516 
00517 
00518 /**
00519  * @brief Create a SQL function mapping arrays of attribute values to the Naive
00520  *        Bayes classification.
00521  *
00522  * The created SQL function is bound to the given feature probabilities and
00523  * class priors. Its declaration will be:
00524  *
00525  * <tt>
00526  * FUNCTION <em>destName</em> (attributes INTEGER[], smoothingFactor DOUBLE PRECISION)
00527  * RETURNS INTEGER[]</tt>
00528  *
00529  * The return type is \c INTEGER[] because the Naive Bayes classification might
00530  * be ambiguous (in which case all of the most likely candiates are returned).
00531  *
00532  * @param featureProbsSource Name of table with precomputed feature
00533  *        probabilities, as created with create_nb_prepared_data_tables()
00534  * @param classPriorsSource Name of table with precomputed class priors, as
00535  *        created with create_nb_prepared_data_tables()
00536  * @param numAttrs Number of attributes to use for classification
00537  * @param destName Name of the function to create
00538  *
00539  * @note
00540  * Just like \ref create_nb_classify_view and \ref create_nb_probs_view,
00541  * also \c create_nb_classify_fn can be called in an ad-hoc fashion. See
00542  * \ref grp_bayes for instructions.
00543  *
00544  * @usage
00545  * -# Create classification function:
00546  *    <pre>SELECT create_nb_classify_fn(
00547  *    '<em>featureProbsSource</em>', '<em>classPriorsSource</em>',
00548  *    <em>numAttrs</em>, '<em>destName</em>'
00549  *);</pre>
00550  * -# Run classification function:
00551  *    <pre>SELECT <em>destName</em>(<em>attributes</em>, <em>smoothingFactor</em>);</pre>
00552  *
00553  * @note
00554  * On Greenplum, the generated SQL function can only be called on the master.
00555  *
00556  * @internal
00557  * @sa This function is a wrapper for bayes::create_classification_function().
00558  */
00559 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn(
00560     "featureProbsSource" VARCHAR,
00561     "classPriorsSource" VARCHAR,
00562     "numAttrs" INTEGER,
00563     "destName" VARCHAR)
00564 RETURNS VOID
00565 AS $$PythonFunction(bayes, bayes, create_classification_function)$$
00566 LANGUAGE plpythonu VOLATILE;
00567 
00568 CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn(
00569     "trainingSource" VARCHAR,
00570     "trainingClassColumn" VARCHAR,
00571     "trainingAttrColumn" VARCHAR,
00572     "numAttrs" INTEGER,
00573     "destName" VARCHAR)
00574 RETURNS VOID
00575 AS $$PythonFunction(bayes, bayes, create_classification_function)$$
00576 LANGUAGE plpythonu VOLATILE;