docs/v1.1/bayes_8sql__in_source.html

/* ----------------------------------------------------------------------- *//**

 *

 * @file bayes.sql_in

 *

 * @brief SQL functions for naive Bayes

 * @date   January 2011

 *

 * @sa For a brief introduction to Naive Bayes Classification, see the module

 *     description \ref grp_bayes.

 *

 *//* ----------------------------------------------------------------------- */


m4_include(`SQLCommon.m4')


/**

@addtogroup grp_bayes


\warning <em> This MADlib method is still in early stage development. There may be some

issues that will be addressed in a future version. Interface and implementation

is subject to change. </em>


@about


Naive Bayes refers to a stochastic model where all independent variables

\f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context)

independently contribute to the probability that a data point belongs to a

certain class \f$ c \f$. In detail, \b Bayes' theorem states that

\f[

    \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n)

    =   \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)}

             {\Pr(A_1 = a_1, \dots, A_n = a_n)}

    \,,

\f]

and the \b naive assumption is that

\f[

    \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)

    =   \prod_{i=1}^n \Pr(A_i = a_i \mid C = c)

    \,.

\f]

Naives Bayes classification estimates feature probabilities and class priors

using maximum likelihood or Laplacian smoothing. These parameters are then used

to classifying new data.


A Naive Bayes classifier computes the following formula:

\f[

    \text{classify}(a_1, ..., a_n)

    =   \arg\max_c \left\{

            \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c)

        \right\}

\f]

where \f$ c \f$ ranges over all classes in the training data and probabilites

are estimated with relative frequencies from the training set.

There are different ways to estimate the feature probabilities

\f$ P(A_i = a \mid C = c) \f$.  The maximum likelihood estimate takes the

relative frequencies. That is:

\f[

    P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c}

\f]

where

- \f$ \#(c,i,a) \f$ denotes the # of training samples where attribute \f$ i \f$

  is \f$ a \f$ and class is \f$ c \f$

- \f$ \#c \f$ denotes the # of training samples where class is \f$ c \f$.


Since the maximum likelihood sometimes results in estimates of "0", you might

want to use a "smoothed" estimate. To do this, you add a number of "virtual"

samples and make the assumption that these samples are evenly distributed among

the values assumed by attribute \f$ i \f$ (that is, the set of all values

observed for attribute \f$ a \f$ for any class):


\f[

    P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i}

\f]

where

- \f$ \#i \f$ denotes the # of distinct values for attribute \f$ i \f$ (for all

  classes)

- \f$ s \geq 0 \f$ denotes the smoothing factor.


The case \f$ s = 1 \f$ is known as "Laplace smoothing". The case \f$ s = 0 \f$

trivially reduces to maximum-likelihood estimates.


\b Note:

(1) The probabilities computed on the platforms of PostgreSQL and Greenplum

database have a small difference due to the nature of floating point

computation. Usually this is not important. However, if a data point has

\f[

P(C=c_i \mid A) \approx P(C=c_j \mid A)

\f]

for two classes, this data point might be classified into diferent classes on

PostgreSQL and Greenplum. This leads to the differences in classifications

on PostgreSQL and Greenplum for some data sets, but this should not

affect the quality of the results.


(2) When two classes have equal and highest probability among all classes,

the classification result is an array of these two classes, but the order

of the two classes is random.


(3) The current implementation of Naive Bayes classification is only suitable

for discontinuous (categorial) attributes.


For continuous data, a typical assumption, usually used for small datasets,

is that the continuous values associated with each class are distributed

according to a Gaussian distribution,

and then the probabilities \f$ P(A_i = a \mid C=c) \f$ can be estimated.

Another common technique for handling continuous values, which is better for

large data sets, is to use binning to discretize the values, and convert the

continuous data into categorical bins. These approaches are currently not

implemented and planned for future releases.


(4) One can still provide floating point data to the naive Bayes

classification function. Floating point numbers can be used as symbolic

substitutions for categorial data. The classification would work best if

there are sufficient data points for each floating point attribute. However,

if floating point numbers are used as continuous data, no warning is raised and

the result may not be as expected.


@input


The <b>training data</b> is expected to be of the following form:

<pre>{TABLE|VIEW} <em>trainingSource</em> (

    ...

    <em>trainingClassColumn</em> INTEGER,

    <em>trainingAttrColumn</em> INTEGER[],

    ...

)</pre>


The <b>data to classify</b> is expected to be of the following form:

<pre>{TABLE|VIEW} <em>classifySource</em> (

    ...

    <em>classifyKeyColumn</em> ANYTYPE,

    <em>classifyAttrColumn</em> INTEGER[],

    ...

)</pre>


@usage


- Precompute feature probabilities and class priors:

  <pre>SELECT \ref create_nb_prepared_data_tables(

    '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',

    <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'

    );</pre>

  This creates table <em>featureProbsName</em> for storing feature

  probabilities and table <em>classPriorsName</em> for storing the class priors.

- Perform Naive Bayes classification:

  <pre>SELECT \ref create_nb_classify_view(

    '<em>featureProbsName</em>', '<em>classPriorsName</em>',

    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',

    <em>numAttrs</em>, '<em>destName</em>'

    );</pre>

  This creates the view <tt><em>destName</em></tt> mapping

  <em>classifyKeyColumn</em> to the Naive Bayes classification:

  <pre>key | nb_classification

----+------------------

...</pre>

- Compute Naive Bayes probabilities:

  <pre>SELECT \ref create_nb_probs_view(

    '<em>featureProbsName</em>', '<em>classPriorsName</em>',

    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',

    <em>numAttrs</em>, '<em>destName</em>'

);</pre>

  This creates the view <tt><em>destName</em></tt> mapping

  <em>classifyKeyColumn</em> and every single class to the Naive Bayes

  probability:

  <pre>key | class | nb_prob

----+-------+--------

...</pre>

- Ad-hoc execution (no precomputation):

  Functions \ref create_nb_classify_view and

  \ref create_nb_probs_view can be used in an ad-hoc fashion without the above

  precomputation step. In this case, replace the function arguments

  <pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre>

  with

  <pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>'</pre>


@examp


The following is an extremely simplified example of the above option #1 which

can by verified by hand.


-#  The training and the classification data:

\verbatim

sql> SELECT * FROM training;

 id | class | attributes

----+-------+------------

  1 |     1 | {1,2,3}

  2 |     1 | {1,2,1}

  3 |     1 | {1,4,3}

  4 |     2 | {1,2,2}

  5 |     2 | {0,2,2}

  6 |     2 | {0,1,3}

(6 rows)


sql> select * from toclassify;

 id | attributes

----+------------

  1 | {0,2,1}

  2 | {1,2,3}

(2 rows)

\endverbatim

-#  Precompute feature probabilities and class priors

\verbatim

sql> SELECT madlib.create_nb_prepared_data_tables(

'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors');

\endverbatim

-#  Optionally check the contents of the precomputed tables:

\verbatim

sql> SELECT * FROM nb_class_priors;

 class | class_cnt | all_cnt

-------+-----------+---------

     1 |         3 |       6

     2 |         3 |       6

(2 rows)


sql> SELECT * FROM nb_feature_probs;

 class | attr | value | cnt | attr_cnt

-------+------+-------+-----+----------

     1 |    1 |     0 |   0 |        2

     1 |    1 |     1 |   3 |        2

     1 |    2 |     1 |   0 |        3

     1 |    2 |     2 |   2 |        3

...

\endverbatim

-#  Create the view with Naive Bayes classification and check the results:

\verbatim

sql> SELECT madlib.create_nb_classify_view (

'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast');


sql> SELECT * FROM nb_classify_view_fast;

 key | nb_classification

-----+-------------------

   1 | {2}

   2 | {1}

(2 rows)

\endverbatim

-#  Look at the probabilities for each class (note that we use "Laplacian smoothing"):

\verbatim

sql> SELECT madlib.create_nb_probs_view (

'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast');


sql> SELECT * FROM nb_probs_view_fast;

 key | class | nb_prob

-----+-------+---------

   1 |     1 |     0.4

   1 |     2 |     0.6

   2 |     1 |    0.75

   2 |     2 |    0.25

(4 rows)

\endverbatim


@literature


[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter

    <em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic

    Regression</em> available at: http://www.cs.cmu.edu/~tom/NewChapters.html


[2] Wikipedia, Naive Bayes classifier,

    http://en.wikipedia.org/wiki/Naive_Bayes_classifier


@sa File bayes.sql_in documenting the SQL functions.


@internal

@sa namespace bayes (documenting the implementation in Python)

@endinternal


*/


-- Begin of argmax definition


CREATE TYPE MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS (

    args INTEGER[],

    value DOUBLE PRECISION

);


CREATE FUNCTION MADLIB_SCHEMA.argmax_transition(

    oldmax MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,

    newkey INTEGER,

    newvalue DOUBLE PRECISION)

RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS

$$

    SELECT CASE WHEN $3 < $1.value OR $2 IS NULL OR ($3 IS NULL AND NOT $1.value IS NULL) THEN $1

                WHEN $3 = $1.value OR ($3 IS NULL AND $1.value IS NULL AND NOT $1.args IS NULL)

                    THEN ($1.args || $2, $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE

                ELSE (array[$2], $3)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE

           END

$$

LANGUAGE sql IMMUTABLE;


CREATE FUNCTION MADLIB_SCHEMA.argmax_combine(

    max1 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,

    max2 MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE)

RETURNS MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE AS

$$

    -- If SQL guaranteed short-circuit evaluation, the following could become

    -- shorter. Unfortunately, this is not the case.

    -- Section 6.3.3.3 of ISO/IEC 9075-1:2008 Framework (SQL/Framework):

    --

    --  "However, it is implementation-dependent whether expressions are

    --   actually evaluated left to right, particularly when operands or

    --   operators might cause conditions to be raised or if the results of the

    --   expressions can be determined without completely evaluating all parts

    --   of the expression."

    --

    -- Again, the optimizer does its job hopefully.

    SELECT CASE WHEN $1 IS NULL THEN $2

                WHEN $2 IS NULL THEN $1

                WHEN ($1.value = $2.value) OR ($1.value IS NULL AND $2.value IS NULL)

                    THEN ($1.args || $2.args, $1.value)::MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE

                WHEN $1.value IS NULL OR $1.value < $2.value THEN $2

                ELSE $1

           END

$$

LANGUAGE sql IMMUTABLE;


CREATE FUNCTION MADLIB_SCHEMA.argmax_final(

    finalstate MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE)

RETURNS INTEGER[] AS

$$

    SELECT $1.args

$$

LANGUAGE sql IMMUTABLE;


/**

 * @internal

 * @brief Argmax: Return the key of the row for which value is maximal

 *

 * The "index set" of the argmax function is of type INTEGER and we range over

 * DOUBLE PRECISION values. It is not required that all keys are distinct.

 *

 * @note

 * argmax should only be used on unsorted data because it will not exploit

 * indices, and its running time is \f$ \Theta(n) \f$.

 *

 * @implementation

 * The implementation is in SQL, with a flavor of functional programming.

 * The hope is that the optimizer does a good job here.

 */

CREATE AGGREGATE MADLIB_SCHEMA.argmax(/*+ key */ INTEGER, /*+ value */ DOUBLE PRECISION) (

    SFUNC=MADLIB_SCHEMA.argmax_transition,

    STYPE=MADLIB_SCHEMA.ARGS_AND_VALUE_DOUBLE,

    m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.argmax_combine,')

    FINALFUNC=MADLIB_SCHEMA.argmax_final

);


/**

 * @brief Precompute all class priors and feature probabilities

 *

 * Feature probabilities are stored in a table of format

 * <pre>TABLE <em>featureProbsDestName</em> (

 *    class INTEGER,

 *    attr INTEGER,

 *    value INTEGER,

 *    cnt INTEGER,

 *    attr_cnt INTEGER

 *)</pre>

 *

 * Class priors are stored in a table of format

 * <pre>TABLE <em>classPriorsDestName</em> (

 *    class INTEGER,

 *    class_cnt INTEGER,

 *    all_cnt INTEGER

 *)</pre>

 *

 * @param trainingSource Name of relation containing the training data

 * @param trainingClassColumn Name of class column in training data

 * @param trainingAttrColumn Name of attributes-array column in training data

 * @param numAttrs Number of attributes to use for classification

 * @param featureProbsDestName Name of feature-probabilities table to create

 * @param classPriorsDestName Name of class-priors table to create

 *

 * @usage

 * Precompute feature probabilities and class priors:

 * <pre>SELECT \ref create_nb_prepared_data_tables(

 *    '<em>trainingSource</em>', '<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>',

 *    <em>numAttrs</em>, '<em>featureProbsName</em>', '<em>classPriorsName</em>'

 *);</pre>

 *

 * @internal

 * @sa This function is a wrapper for bayes::create_prepared_data().

 */

CREATE FUNCTION MADLIB_SCHEMA.create_nb_prepared_data_tables(

    "trainingSource" VARCHAR,

    "trainingClassColumn" VARCHAR,

    "trainingAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "featureProbsDestName" VARCHAR,

    "classPriorsDestName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_prepared_data_table)$$

LANGUAGE plpythonu VOLATILE;


/**

 * @brief Create a view with columns <tt>(key, nb_classification)</tt>

 *

 * The created relation will be

 *

 * <tt>{TABLE|VIEW} <em>destName</em> (key, nb_classification)</tt>

 *

 * where \c nb_classification is an array containing the most likely

 * class(es) of the record in \em classifySource identified by \c key.

 *

 * @param featureProbsSource Name of table with precomputed feature

 *        probabilities, as created with create_nb_prepared_data_tables()

 * @param classPriorsSource Name of table with precomputed class priors, as

 *        created with create_nb_prepared_data_tables()

 * @param classifySource Name of the relation that contains data to be classified

 * @param classifyKeyColumn Name of column in \em classifySource that can

 *        serve as unique identifier (the key of the source relation)

 * @param classifyAttrColumn Name of attributes-array column in \em classifySource

 * @param numAttrs Number of attributes to use for classification

 * @param destName Name of the view to create

 *

 * @note \c create_nb_classify_view can be called in an ad-hoc fashion. See

 * \ref grp_bayes for instructions.

 *

 * @usage

 * -# Create Naive Bayes classifications view:

 *  <pre>SELECT \ref create_nb_classify_view(

 *    '<em>featureProbsName</em>', '<em>classPriorsName</em>',

 *    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',

 *    <em>numAttrs</em>, '<em>destName</em>'

 *);</pre>

 * -# Show Naive Bayes classifications:

 *    <pre>SELECT * FROM <em>destName</em>;</pre>

 *

 * @internal

 * @sa This function is a wrapper for bayes::create_classification(). See there

 *     for details.

 */

CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view(

    "featureProbsSource" VARCHAR,

    "classPriorsSource" VARCHAR,

    "classifySource" VARCHAR,

    "classifyKeyColumn" VARCHAR,

    "classifyAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_classification_view)$$

LANGUAGE plpythonu VOLATILE;


CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_view(

    "trainingSource" VARCHAR,

    "trainingClassColumn" VARCHAR,

    "trainingAttrColumn" VARCHAR,

    "classifySource" VARCHAR,

    "classifyKeyColumn" VARCHAR,

    "classifyAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_classification_view)$$

LANGUAGE plpythonu VOLATILE;


/**

 * @brief Create view with columns <tt>(key, class, nb_prob)</tt>

 *

 * The created view will be of the following form:

 *

 * <pre>VIEW <em>destName</em> (

 *    key ANYTYPE,

 *    class INTEGER,

 *    nb_prob FLOAT8

 *)</pre>

 *

 * where \c nb_prob is the Naive-Bayes probability that \c class is the true

 * class of the record in \em classifySource identified by \c key.

 *

 * @param featureProbsSource Name of table with precomputed feature

 *        probabilities, as created with create_nb_prepared_data_tables()

 * @param classPriorsSource Name of table with precomputed class priors, as

 *        created with create_nb_prepared_data_tables()

 * @param classifySource Name of the relation that contains data to be classified

 * @param classifyKeyColumn Name of column in \em classifySource that can

 *        serve as unique identifier (the key of the source relation)

 * @param classifyAttrColumn Name of attributes-array column in \em classifySource

 * @param numAttrs Number of attributes to use for classification

 * @param destName Name of the view to create

 *

 * @note \c create_nb_probs_view can be called in an ad-hoc fashion. See

 * \ref grp_bayes for instructions.

 *

 * @usage

 * -# Create Naive Bayes probabilities view:

 *  <pre>SELECT \ref create_nb_probs_view(

 *    '<em>featureProbsName</em>', '<em>classPriorsName</em>',

 *    '<em>classifySource</em>', '<em>classifyKeyColumn</em>', '<em>classifyAttrColumn</em>',

 *    <em>numAttrs</em>, '<em>destName</em>'

 *);</pre>

 * -# Show Naive Bayes probabilities:

 *    <pre>SELECT * FROM <em>destName</em>;</pre>

 *

 * @internal

 * @sa This function is a wrapper for bayes::create_bayes_probabilities().

 */

CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view(

    "featureProbsSource" VARCHAR,

    "classPriorsSource" VARCHAR,

    "classifySource" VARCHAR,

    "classifyKeyColumn" VARCHAR,

    "classifyAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$

LANGUAGE plpythonu VOLATILE;


CREATE FUNCTION MADLIB_SCHEMA.create_nb_probs_view(

    "trainingSource" VARCHAR,

    "trainingClassColumn" VARCHAR,

    "trainingAttrColumn" VARCHAR,

    "classifySource" VARCHAR,

    "classifyKeyColumn" VARCHAR,

    "classifyAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_bayes_probabilities_view)$$

LANGUAGE plpythonu VOLATILE;


/**

 * @brief Create a SQL function mapping arrays of attribute values to the Naive

 *        Bayes classification.

 *

 * The created SQL function is bound to the given feature probabilities and

 * class priors. Its declaration will be:

 *

 * <tt>

 * FUNCTION <em>destName</em> (attributes INTEGER[], smoothingFactor DOUBLE PRECISION)

 * RETURNS INTEGER[]</tt>

 *

 * The return type is \c INTEGER[] because the Naive Bayes classification might

 * be ambiguous (in which case all of the most likely candiates are returned).

 *

 * @param featureProbsSource Name of table with precomputed feature

 *        probabilities, as created with create_nb_prepared_data_tables()

 * @param classPriorsSource Name of table with precomputed class priors, as

 *        created with create_nb_prepared_data_tables()

 * @param numAttrs Number of attributes to use for classification

 * @param destName Name of the function to create

 *

 * @note

 * Just like \ref create_nb_classify_view and \ref create_nb_probs_view,

 * also \c create_nb_classify_fn can be called in an ad-hoc fashion. See

 * \ref grp_bayes for instructions.

 *

 * @usage

 * -# Create classification function:

 *    <pre>SELECT create_nb_classify_fn(

 *    '<em>featureProbsSource</em>', '<em>classPriorsSource</em>',

 *    <em>numAttrs</em>, '<em>destName</em>'

 *);</pre>

 * -# Run classification function:

 *    <pre>SELECT <em>destName</em>(<em>attributes</em>, <em>smoothingFactor</em>);</pre>

 *

 * @note

 * On Greenplum, the generated SQL function can only be called on the master.

 *

 * @internal

 * @sa This function is a wrapper for bayes::create_classification_function().

 */

CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn(

    "featureProbsSource" VARCHAR,

    "classPriorsSource" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_classification_function)$$

LANGUAGE plpythonu VOLATILE;


CREATE FUNCTION MADLIB_SCHEMA.create_nb_classify_fn(

    "trainingSource" VARCHAR,

    "trainingClassColumn" VARCHAR,

    "trainingAttrColumn" VARCHAR,

    "numAttrs" INTEGER,

    "destName" VARCHAR)

RETURNS VOID

AS $$PythonFunction(bayes, bayes, create_classification_function)$$

LANGUAGE plpythonu VOLATILE;