User Documentation
assoc_rules.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file assoc_rules.sql_in
00004  *
00005  * @brief The \ref assoc_rules function computes association rules for a given set of data. The data is assumed to have two dimensions; items (between which we are trying to discover associations), and a transaction id. This tranaction id groups the items by event and could also be a user id, date, etc. depending on the context of the data. This function assumes the data is stored in two columns with one transaction id and one item per row.
00006 
00007  * @date June 2011
00008  * @modified August 2012
00009  *
00010  * @sa For a brief introduction to the association rules implementation, see the module
00011  * description \ref grp_assoc_rules.
00012  *
00013  *//* ----------------------------------------------------------------------- */
00014 
00015 m4_include(`SQLCommon.m4')
00016 
00017 
00018 /**
00019 @addtogroup grp_assoc_rules
00020 
00021 @about
00022 This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds.
00023 
00024 Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customers' purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields.
00025 
00026 This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item.
00027 
00028 <pre>
00029      tran_id | product
00030     ---------+---------
00031            1 | 1
00032            1 | 2
00033            1 | 3
00034            1 | 4
00035            2 | 3
00036            2 | 4
00037            2 | 5
00038            3 | 1
00039            3 | 4
00040            3 | 6
00041     ...
00042 </pre>
00043 
00044 \b Rules
00045 
00046 Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand-side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence.
00047 
00048 Given any association rule "If X, then Y", the association rules function will also calculate the following metrics:
00049 - Support: The ratio of transactions that contain X to all transactions, T
00050 \f[
00051 S (X) = \frac{Total X}{Total transactions}
00052 \f]
00053 
00054 - Confidence: The ratio of transactions that contain \f$ X,Y \f$ to transactions that contain \f$ X \f$. One could view this metric as the conditional probability of \f$ Y \f$ , given \f$ X \f$ . \f$ P(Y|X) \f$
00055 
00056 \f[
00057 C (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X)}
00058 \f]
00059 
00060 - Lift: The ratio of observed support of \f$ X,Y \f$ to the expected support of \f$ X,Y \f$ , assuming \f$ X \f$ and \f$ Y \f$ are independent.
00061 \f[
00062 L (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X) \cdot s(Y)}
00063 \f]
00064 
00065 - Conviction: The ratio of expected support of \f$ X \f$ occurring without\f$ Y \f$ assuming \f$ X \f$ and \f$ \neg Y \f$ are independent, to the observed support of \f$ X \f$ occuring without \f$ Y \f$. If conviction is greater than 1, then this metric shows that incorrect predictions ( \f$ X \Rightarrow Y \f$ ) occur less often than if these two actions were independent. This metric can be viewed as the ratio that the association rule would be incorrect if the actions were independent (i.e. a conviction of 1.5 indicates that if the variables were independent, this rule would be incorrect 50% more often.)
00066 
00067 \f[
00068 Conv (X \Rightarrow Y) = \frac{1 - S(Y)}{1 - C(X \Rightarrow Y)}
00069 \f]
00070 
00071 
00072 \b Apriori \b algorithm
00073 
00074 Although there are many algorithms that generate association rules, the classic algorithm used is called Apriori (which we implemented in this module). It is a breadth-first search, as opposed to depth-first searches like eclat. Frequent itemsets of order \f$ n \f$ are generated from sets of order \f$ n - 1 \f$. Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided:
00075 
00076 \e Initial \e step
00077 -# Generate all itemsets of order 1
00078 -# Eliminate itemsets that have support is less than minimum support
00079 
00080 \e Main \e algorithm
00081 -# For \f$ n \ge 2 \f$, generate itemsets of order \f$ n \f$ by combining the itemsets of order \f$ n - 1 \f$. This is done by doing the union of two itemsets that have identical items except one.
00082 -# Eliminate itemsets that have (n-1) order subsets with insufficient support
00083 -# Eliminate itemsets with insufficient support
00084 -# Repeat until itemsets cannot be generated
00085 
00086 \e Association \e rule \e generation
00087 
00088 Given a frequent itemset \f$ A \f$ generated from the Apriori algorithm, and all subsets \f$ B \f$ , we generate rules such that \f$ B \Rightarrow (A - B) \f$ meets minimum confidence requirements.
00089 
00090 @input
00091 
00092 The input data is expected to be of the following form:
00093 <pre>{TABLE|VIEW} <em>input_table</em> (
00094     <em>trans_id</em> INTEGER,
00095     <em>product</em> TEXT
00096 )</pre>
00097 
00098 The algorithm will map the product names to consective integer ids starting at 1. If they are already structured this way, then the ids will not change.
00099 
00100 @usage
00101 - Association rules can be called by:
00102   <pre>SELECT \ref assoc_rules(
00103     <em>support</em>, <em>confidence</em>,'<em>tid_col</em>','<em>item_col</em>',
00104     '<em>input_table</em>','<em>output_schema</em>', <em> verbose </em>
00105     );</pre>
00106   This will generate all association rules that meet a minimum support of <em>support</em> and confidence of <em>confidence</em>.
00107 
00108 - The results containing the rules, support, confidence, lift, and conviction are stored in the table assoc_rules in the schema specified by <em>output_schema</em>.
00109 <pre>
00110     Table "output_schema.assoc_rules"
00111       Column   |       Type       | Modifiers 
00112    ------------+------------------+-----------
00113     ruleid     | integer          | 
00114     pre        | text[]           | 
00115     post       | text[]           | 
00116     support    | double precision | 
00117     confidence | double precision | 
00118     lift       | double precision | 
00119     conviction | double precision | 
00120     Distributed by: (ruleid)
00121 
00122 </pre>
00123     The \c pre and \c post are the itemsets of left and right hand sides of the association rule respectively. The \c support, \c confidence, \c lift, and \c conviction columns are calculated as mentioned in the about section.
00124 
00125 @implementation
00126 
00127 The association rules function will always create a table named assoc_rules. Please make a copy of this table before running the function again if you would like to keep multiple association rule tables.
00128 
00129 @examp
00130 
00131 Let us take a look at some sample transactional data and generate association rules:
00132 
00133 \code
00134 DROP TABLE IF EXISTS test_data;
00135 CREATE TABLE test_data (
00136     trans_id INT
00137     , product text
00138 );
00139 
00140 INSERT INTO test_data VALUES (1, 'beer');
00141 INSERT INTO test_data VALUES (1, 'diapers');
00142 INSERT INTO test_data VALUES (1, 'chips');
00143 INSERT INTO test_data VALUES (2, 'beer');
00144 INSERT INTO test_data VALUES (2, 'diapers');
00145 INSERT INTO test_data VALUES (3, 'beer');
00146 INSERT INTO test_data VALUES (3, 'diapers');
00147 INSERT INTO test_data VALUES (4, 'beer');
00148 INSERT INTO test_data VALUES (4, 'chips');
00149 INSERT INTO test_data VALUES (5, 'beer');
00150 INSERT INTO test_data VALUES (6, 'beer');
00151 INSERT INTO test_data VALUES (6, 'diapers');
00152 INSERT INTO test_data VALUES (6, 'chips');
00153 INSERT INTO test_data VALUES (7, 'beer');
00154 INSERT INTO test_data VALUES (7, 'diapers');
00155 
00156 \endcode
00157 
00158 Let \f$ min(support) = .25 \f$ and \f$ min(confidence) = .5 \f$, and the output schema be 'myschema'. For this example, we will set verbose to 'true' so that we have some insight into the progress of the function. We can now generate association rules as follows:
00159 
00160 \code
00161 
00162 SELECT * FROM MADLIB_SCHEMA.assoc_rules (.25, .5, 'trans_id', 'product', 'test_data','myschema', false);
00163 
00164 \endcode
00165 
00166 This should generate this output:
00167 
00168 \code
00169 
00170  output_schema | output_table | total_rules | total_time      
00171 ---------------+--------------+-------------+-----------------
00172  myschema      | assoc_rules  |           7 | 00:00:03.162094
00173 (1 row)
00174 
00175 
00176 \endcode
00177 
00178 The association rules are stored in the myschema.assoc_rules:
00179 
00180 \code
00181 select * from myschema.assoc_rules order by support desc;
00182  ruleid |       pre       |      post      |      support      |    confidence     |       lift        |    conviction     
00183 --------+-----------------+----------------+-------------------+-------------------+-------------------+-------------------
00184       4 | {diapers}       | {beer}         | 0.714285714285714 |                 1 |                 1 |                 0
00185       2 | {beer}          | {diapers}      | 0.714285714285714 | 0.714285714285714 |                 1 |                 1
00186       1 | {chips}         | {beer}         | 0.428571428571429 |                 1 |                 1 |                 0
00187       5 | {chips}         | {beer,diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
00188       6 | {chips,beer}    | {diapers}      | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
00189       7 | {chips,diapers} | {beer}         | 0.285714285714286 |                 1 |                 1 |                 0
00190       3 | {chips}         | {diapers}      | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
00191 (7 rows)
00192 
00193 \endcode
00194 
00195 @sa File assoc_rules.sql_in documenting the SQL function.
00196 
00197 */
00198 
00199 /*
00200  * @brief The result data type for the association rule API
00201  *
00202  * output_schema the name of the output schema.
00203  * output_table the name of the output table.
00204  * total_rules the number of total rules.
00205  * total_time the running time.
00206  */
00207 CREATE TYPE MADLIB_SCHEMA.assoc_rules_results AS
00208     (
00209     output_schema TEXT,
00210     output_table TEXT,
00211     total_rules INT,
00212     total_time INTERVAL
00213 );
00214 
00215 
00216 /*
00217  * @brief Given the text form of a closed frequent pattern (cfp), this function
00218  * generates the association rules for that pattern. We use text format
00219  * because text values are hash joinable. The output is a set of text
00220  * array. For example, assuming the input pattern is '1,2,3'.
00221  * The result rules:
00222  * array['1', '2,3']
00223  * array['2', '1,3']
00224  * array['3', '1,2']
00225  * array['1,2', '3']
00226  * array['1,3', '2']
00227  * array['2,3', '1']
00228  * Note that two meaningless rules will be excluded:
00229  * array['1,2,3', NULL]
00230  * array[NULL, '1,2,3']
00231  *
00232  * @param arg 1 The text form of a closed frequent pattern.
00233  * @param arg 2 The number of items in the pattern.
00234  *
00235  * @return A set of text array. Each array has two elements, corresponding to
00236  * the left and right parts of an association rule.
00237  *
00238  */
00239 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gen_rules_from_cfp
00240     (
00241     TEXT,
00242     INT
00243     )
00244 RETURNS SETOF TEXT[] AS 'MODULE_PATHNAME'
00245 LANGUAGE C STRICT IMMUTABLE;
00246 
00247 
00248 /**
00249  *
00250  * @param support minimum level of support needed for each itemset to
00251  * be included in result
00252  * @param confidence minimum level of confidence needed for each rule to
00253  * be included in result
00254  * @param tid_col name of the column storing the transaction ids
00255  * @param item_col name of the column storing the products
00256  * @param input_table name of the table where the data is stored
00257  * @param output_schema name of the schema where the final results will be stored
00258  * @param verbose determining if output contains comments
00259  *
00260  * @returns The schema and table name containing association rules,
00261  * and total number of rules found.
00262  *
00263  * This function computes the association rules between products in a data set.
00264  * It reads the name of the table, the column names of the product and ids, and
00265  * computes ssociation rules using the Apriori algorithm, and subject to the
00266  * support and confidence constraints as input by the user. This version of
00267  * association rules has verbose functionality. When verbose is true, output of
00268  * function includes iteration steps and comments on Apriori algorithm steps.
00269  *
00270  */
00271 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules
00272     (
00273     support FLOAT8,
00274     confidence FLOAT8,
00275     tid_col TEXT,
00276     item_col TEXT,
00277     input_table TEXT,
00278     output_schema TEXT,
00279     verbose BOOLEAN
00280    )
00281 RETURNS MADLIB_SCHEMA.assoc_rules_results
00282 AS $$
00283 
00284     PythonFunctionBodyOnly(`assoc_rules', `assoc_rules')
00285 
00286     plpy.execute("SET client_min_messages = error;")
00287 
00288     # schema_madlib comes from PythonFunctionBodyOnly
00289     return assoc_rules.assoc_rules(
00290         schema_madlib,
00291         support,
00292         confidence,
00293         tid_col,
00294         item_col,
00295         input_table,
00296         output_schema,
00297         verbose
00298         );
00299 
00300 $$ LANGUAGE plpythonu;
00301 
00302 
00303 /**
00304  *
00305  * @brief The short form of the above function with vobose removed.
00306  *
00307  */
00308 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules
00309     (
00310     support FLOAT8,
00311     confidence FLOAT8,
00312     tid_col TEXT,
00313     item_col TEXT,
00314     input_table TEXT,
00315     output_schema TEXT
00316     )
00317 RETURNS MADLIB_SCHEMA.assoc_rules_results
00318 AS $$
00319 
00320     PythonFunctionBodyOnly(`assoc_rules', `assoc_rules')
00321 
00322     plpy.execute("SET client_min_messages = error;")
00323 
00324     # schema_madlib comes from PythonFunctionBodyOnly
00325     return assoc_rules.assoc_rules(
00326         schema_madlib,
00327         support,
00328         confidence,
00329         tid_col,
00330         item_col,
00331         input_table,
00332         output_schema,
00333         False
00334         );
00335 
00336 $$ LANGUAGE plpythonu;