Naive Bayes refers to a stochastic model where all independent variables \( a_1, \dots, a_n \) (often referred to as attributes in this context) independently contribute to the probability that a data point belongs to a certain class \( c \).
Naives Bayes classification estimates feature probabilities and class priors using maximum likelihood or Laplacian smoothing. For numeric attributes, Gaussian smoothing can be used to estimate the feature probabilities.These parameters are then used to classify new data.
For data with only categorical attributes, precompute feature probabilities and class priors using the following function:
create_nb_prepared_data_tables ( trainingSource, trainingClassColumn, trainingAttrColumn, numAttrs, featureProbsName, classPriorsName )
For data containing both categorical and numeric attributes, use the following form to precompute the Gaussian parameters (mean and variance) for numeric attributes alongside the feature probabilities for categorical attributes and class priors.
create_nb_prepared_data_tables ( trainingSource, trainingClassColumn, trainingAttrColumn, numericAttrsColumnIndices, numAttrs, featureProbsName, numericAttrParamsName, classPriorsName )
The trainingSource is expected to be of the following form:
{TABLE|VIEW} trainingSource ( ... trainingClassColumn INTEGER, trainingAttrColumn INTEGER[] OR NUMERIC[] OR FLOAT8[], ... )
numericAttrsColumnIndices should be of type TEXT, specified as an array of indices (starting from 1) in the trainingAttrColumn attributes-array that correspond to numeric attributes.
The two output tables are:
In addition to the above, if the function specifying numeric attributes is used, an additional table numericAttrParamsName is created which stores the Gaussian parameters for the numeric attributes.
Perform Naive Bayes classification:
create_nb_classify_view ( featureProbsName, classPriorsName, classifySource, classifyKeyColumn, classifyAttrColumn, numAttrs, destName )
For data with numeric attributes, use the following version:
create_nb_classify_view ( featureProbsName, classPriorsName, classifySource, classifyKeyColumn, classifyAttrColumn, numAttrs, numericAttrParamsName, destName )
The data to classify is expected to be of the following form:
{TABLE|VIEW} classifySource ( ... classifyKeyColumn ANYTYPE, classifyAttrColumn INTEGER[], ... )
This function creates the view destName
mapping classifyKeyColumn to the Naive Bayes classification.
key | nb_classification ---+------------------ ...
Compute Naive Bayes probabilities.
create_nb_probs_view( featureProbsName, classPriorsName, classifySource, classifyKeyColumn, classifyAttrColumn, numAttrs, destName )
For data with numeric attributes , use the following version:
create_nb_probs_view( featureProbsName, classPriorsName, classifySource, classifyKeyColumn, classifyAttrColumn, numAttrs, numericAttrParamsName, destName )
This creates the view destName
mapping classifyKeyColumn and every single class to the Naive Bayes probability:
key | class | nb_prob ---+-------+-------- ...
With ad hoc execution (no precomputation), the functions create_nb_classify_view() and create_nb_probs_view() can be used in an ad-hoc fashion without the precomputation step. In this case, replace the function arguments
'featureProbsName', 'classPriorsName'
with
'trainingSource', 'trainingClassColumn', 'trainingAttrColumn'
for data without any any numeric attributes and with
'trainingSource', 'trainingClassColumn', 'trainingAttrColumn', 'numericAttrsColumnIndices'
for data containing numeric attributes as well.
\[ P(C=c_i \mid A) \approx P(C=c_j \mid A) \]
for two classes, this data point might be classified into diferent classes on PostgreSQL and Greenplum. This leads to the differences in classifications on PostgreSQL and Greenplum for some data sets, but this should not affect the quality of the results.\[ P(A_i=a \mid C=c) = \frac{1}{\sqrt{2\pi\sigma^{2}_c}}exp\left(-\frac{(a-\mu_c)^{2}}{2\sigma^{2}_c}\right) \]
where \(\mu_c\) and \(\sigma^{2}_c\) are the population mean and variance of the attribute for the class \(c\).The following is an extremely simplified example of the above option #1 which can by verified by hand.
SELECT * FROM training;Result:
id | class | attributes ---+-------+------------ 1 | 1 | {1,2,3} 2 | 1 | {1,2,1} 3 | 1 | {1,4,3} 4 | 2 | {1,2,2} 5 | 2 | {0,2,2} 6 | 2 | {0,1,3} (6 rows)
SELECT * FROM toclassify;Result:
id | attributes ---+------------ 1 | {0,2,1} 2 | {1,2,3} (2 rows)
SELECT madlib.create_nb_prepared_data_tables( 'training', 'class', 'attributes', 3, 'nb_feature_probs', 'nb_class_priors' );
SELECT * FROM nb_class_priors;Result:
class | class_cnt | all_cnt ------+-----------+--------- 1 | 3 | 6 2 | 3 | 6 (2 rows)
SELECT * FROM nb_feature_probs;Result:
class | attr | value | cnt | attr_cnt ------+------+-------+-----+---------- 1 | 1 | 0 | 0 | 2 1 | 1 | 1 | 3 | 2 1 | 2 | 1 | 0 | 3 1 | 2 | 2 | 2 | 3 ...
SELECT madlib.create_nb_classify_view( 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_classify_view_fast' ); SELECT * FROM nb_classify_view_fast;Result:
key | nb_classification ----+------------------- 1 | {2} 2 | {1} (2 rows)
SELECT madlib.create_nb_probs_view( 'nb_feature_probs', 'nb_class_priors', 'toclassify', 'id', 'attributes', 3, 'nb_probs_view_fast' ); SELECT * FROM nb_probs_view_fast;Result:
key | class | nb_prob ----+-------+--------- 1 | 1 | 0.4 1 | 2 | 0.6 2 | 1 | 0.75 2 | 2 | 0.25 (4 rows)
The following is an example of using a dataset with both numeric and categorical attributes
SELECT * FROM gaussian_data;Result:
id | sex | attributes ----+-----+--------------- 1 | 1 | {6,180,12} 2 | 1 | {5.92,190,12} 3 | 1 | {5.58,170,11} 4 | 1 | {5.92,165,11} 5 | 2 | {5,100,6} 6 | 2 | {5.5,150,6} 7 | 2 | {5.42,130,7} 8 | 2 | {5.75,150,8} (8 rows)
SELECT * FROM gaussian_test;Result:
id | sex | attributes ----+-----+-------------- 9 | 1 | {5.8,180,11} 10 | 2 | {5,160,6} (2 rows)
SELECT madlib.create_nb_prepared_data_tables( 'gaussian_data', 'sex', 'attributes', 'ARRAY[1,2]', 3, 'categ_feature_probs', 'numeric_attr_params', 'class_priors' );
SELECT * FROM class_priors;Result:
class | class_cnt | all_cnt -------+-----------+--------- 1 | 4 | 8 2 | 4 | 8 (2 rows)
SELECT * FROM categ_feature_probs;Result:
class | attr | value | cnt | attr_cnt -------+------+-------+-----+---------- 2 | 3 | 6 | 2 | 5 1 | 3 | 12 | 2 | 5 2 | 3 | 7 | 1 | 5 1 | 3 | 11 | 2 | 5 2 | 3 | 8 | 1 | 5 2 | 3 | 12 | 0 | 5 1 | 3 | 6 | 0 | 5 2 | 3 | 11 | 0 | 5 1 | 3 | 8 | 0 | 5 1 | 3 | 7 | 0 | 5 (10 rows)
SELECT * FROM numeric_attr_params;Result:
class | attr | attr_mean | attr_var -------+------+----------------------+------------------------ 1 | 1 | 5.8550000000000000 | 0.03503333333333333333 1 | 2 | 176.2500000000000000 | 122.9166666666666667 2 | 1 | 5.4175000000000000 | 0.09722500000000000000 2 | 2 | 132.5000000000000000 | 558.3333333333333333 (4 rows)
SELECT madlib.create_nb_classify_view( 'categ_feature_probs', 'class_priors', 'gaussian_test', 'id', 'attributes', 3, 'numeric_attr_params', 'classify_view' ); SELECT * FROM classify_view;Result:
key | nb_classification ----+------------------- 9 | {1} 10 | {2} (2 rows)
SELECT madlib.create_nb_probs_view( 'categ_feature_probs', 'class_priors', 'gaussian_test', 'id', 'attributes', 3, 'numeric_attr_params', 'probs_view' ); SELECT * FROM probs_view;Result:
key | class | nb_prob -----+-------+---------------------- 9 | 1 | 0.993556745948775 9 | 2 | 0.00644325405122553 10 | 1 | 5.74057538627122e-05 10 | 2 | 0.999942594246137 (4 rows)
In detail, Bayes' theorem states that
\[ \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n) = \frac{\Pr(C = c) \cdot \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)} {\Pr(A_1 = a_1, \dots, A_n = a_n)} \,, \]
and the naive assumption is that
\[ \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c) = \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \,. \]
Naives Bayes classification estimates feature probabilities and class priors using maximum likelihood or Laplacian smoothing. These parameters are then used to classifying new data.
A Naive Bayes classifier computes the following formula:
\[ \text{classify}(a_1, ..., a_n) = \arg\max_c \left\{ \Pr(C = c) \cdot \prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \right\} \]
where \( c \) ranges over all classes in the training data and probabilites are estimated with relative frequencies from the training set. There are different ways to estimate the feature probabilities \( P(A_i = a \mid C = c) \). The maximum likelihood estimate takes the relative frequencies. That is:
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c} \]
where
Since the maximum likelihood sometimes results in estimates of "0", you might want to use a "smoothed" estimate. To do this, you add a number of "virtual" samples and make the assumption that these samples are evenly distributed among the values assumed by attribute \( i \) (that is, the set of all values observed for attribute \( a \) for any class):
\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i} \]
where
The case \( s = 1 \) is known as "Laplace smoothing". The case \( s = 0 \) trivially reduces to maximum-likelihood estimates.
[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter Generativ and Discriminative Classifiers: Naive Bayes and Logistic Regression available at: http://www.cs.cmu.edu/~tom/NewChapters.html
[2] Wikipedia, Naive Bayes classifier, http://en.wikipedia.org/wiki/Naive_Bayes_classifier