MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file assoc_rules.sql_in 00004 * 00005 * @brief The \ref assoc_rules function computes association rules for a given set of data. The data is assumed to have two dimensions; items (between which we are trying to discover associations), and a transaction id. This tranaction id groups the items by event and could also be a user id, date, etc. depending on the context of the data. This function assumes the data is stored in two columns with one transaction id and one item per row. 00006 00007 * @date June 2011 00008 * @modified August 2012 00009 * 00010 * @sa For a brief introduction to the association rules implementation, see the module 00011 * description \ref grp_assoc_rules. 00012 * 00013 *//* ----------------------------------------------------------------------- */ 00014 00015 m4_include(`SQLCommon.m4') 00016 00017 00018 /** 00019 @addtogroup grp_assoc_rules 00020 00021 @about 00022 This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds. 00023 00024 Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customers' purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields. 00025 00026 This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item. 00027 00028 <pre> 00029 tran_id | product 00030 ---------+--------- 00031 1 | 1 00032 1 | 2 00033 1 | 3 00034 1 | 4 00035 2 | 3 00036 2 | 4 00037 2 | 5 00038 3 | 1 00039 3 | 4 00040 3 | 6 00041 ... 00042 </pre> 00043 00044 \b Rules 00045 00046 Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand-side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence. 00047 00048 Given any association rule "If X, then Y", the association rules function will also calculate the following metrics: 00049 - Support: The ratio of transactions that contain X to all transactions, T 00050 \f[ 00051 S (X) = \frac{Total X}{Total transactions} 00052 \f] 00053 00054 - Confidence: The ratio of transactions that contain \f$ X,Y \f$ to transactions that contain \f$ X \f$. One could view this metric as the conditional probability of \f$ Y \f$ , given \f$ X \f$ . \f$ P(Y|X) \f$ 00055 00056 \f[ 00057 C (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X)} 00058 \f] 00059 00060 - Lift: The ratio of observed support of \f$ X,Y \f$ to the expected support of \f$ X,Y \f$ , assuming \f$ X \f$ and \f$ Y \f$ are independent. 00061 \f[ 00062 L (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X) \cdot s(Y)} 00063 \f] 00064 00065 - Conviction: The ratio of expected support of \f$ X \f$ occurring without\f$ Y \f$ assuming \f$ X \f$ and \f$ \neg Y \f$ are independent, to the observed support of \f$ X \f$ occuring without \f$ Y \f$. If conviction is greater than 1, then this metric shows that incorrect predictions ( \f$ X \Rightarrow Y \f$ ) occur less often than if these two actions were independent. This metric can be viewed as the ratio that the association rule would be incorrect if the actions were independent (i.e. a conviction of 1.5 indicates that if the variables were independent, this rule would be incorrect 50% more often.) 00066 00067 \f[ 00068 Conv (X \Rightarrow Y) = \frac{1 - S(Y)}{1 - C(X \Rightarrow Y)} 00069 \f] 00070 00071 00072 \b Apriori \b algorithm 00073 00074 Although there are many algorithms that generate association rules, the classic algorithm used is called Apriori (which we implemented in this module). It is a breadth-first search, as opposed to depth-first searches like eclat. Frequent itemsets of order \f$ n \f$ are generated from sets of order \f$ n - 1 \f$. Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided: 00075 00076 \e Initial \e step 00077 -# Generate all itemsets of order 1 00078 -# Eliminate itemsets that have support is less than minimum support 00079 00080 \e Main \e algorithm 00081 -# For \f$ n \ge 2 \f$, generate itemsets of order \f$ n \f$ by combining the itemsets of order \f$ n - 1 \f$. This is done by doing the union of two itemsets that have identical items except one. 00082 -# Eliminate itemsets that have (n-1) order subsets with insufficient support 00083 -# Eliminate itemsets with insufficient support 00084 -# Repeat until itemsets cannot be generated 00085 00086 \e Association \e rule \e generation 00087 00088 Given a frequent itemset \f$ A \f$ generated from the Apriori algorithm, and all subsets \f$ B \f$ , we generate rules such that \f$ B \Rightarrow (A - B) \f$ meets minimum confidence requirements. 00089 00090 @input 00091 00092 The input data is expected to be of the following form: 00093 <pre>{TABLE|VIEW} <em>input_table</em> ( 00094 <em>trans_id</em> INTEGER, 00095 <em>product</em> TEXT 00096 )</pre> 00097 00098 The algorithm will map the product names to consective integer ids starting at 1. If they are already structured this way, then the ids will not change. 00099 00100 @usage 00101 - Association rules can be called by: 00102 <pre>SELECT \ref assoc_rules( 00103 <em>support</em>, <em>confidence</em>,'<em>tid_col</em>','<em>item_col</em>', 00104 '<em>input_table</em>','<em>output_schema</em>', <em> verbose </em> 00105 );</pre> 00106 This will generate all association rules that meet a minimum support of <em>support</em> and confidence of <em>confidence</em>. 00107 00108 - The results containing the rules, support, confidence, lift, and conviction are stored in the table assoc_rules in the schema specified by <em>output_schema</em>. 00109 <pre> 00110 Table "output_schema.assoc_rules" 00111 Column | Type | Modifiers 00112 ------------+------------------+----------- 00113 ruleid | integer | 00114 pre | text[] | 00115 post | text[] | 00116 support | double precision | 00117 confidence | double precision | 00118 lift | double precision | 00119 conviction | double precision | 00120 Distributed by: (ruleid) 00121 00122 </pre> 00123 The \c pre and \c post are the itemsets of left and right hand sides of the association rule respectively. The \c support, \c confidence, \c lift, and \c conviction columns are calculated as mentioned in the about section. 00124 00125 @implementation 00126 00127 The association rules function will always create a table named assoc_rules. Please make a copy of this table before running the function again if you would like to keep multiple association rule tables. 00128 00129 @examp 00130 00131 Let us take a look at some sample transactional data and generate association rules: 00132 00133 \code 00134 DROP TABLE IF EXISTS test_data; 00135 CREATE TABLE test_data ( 00136 trans_id INT 00137 , product text 00138 ); 00139 00140 INSERT INTO test_data VALUES (1, 'beer'); 00141 INSERT INTO test_data VALUES (1, 'diapers'); 00142 INSERT INTO test_data VALUES (1, 'chips'); 00143 INSERT INTO test_data VALUES (2, 'beer'); 00144 INSERT INTO test_data VALUES (2, 'diapers'); 00145 INSERT INTO test_data VALUES (3, 'beer'); 00146 INSERT INTO test_data VALUES (3, 'diapers'); 00147 INSERT INTO test_data VALUES (4, 'beer'); 00148 INSERT INTO test_data VALUES (4, 'chips'); 00149 INSERT INTO test_data VALUES (5, 'beer'); 00150 INSERT INTO test_data VALUES (6, 'beer'); 00151 INSERT INTO test_data VALUES (6, 'diapers'); 00152 INSERT INTO test_data VALUES (6, 'chips'); 00153 INSERT INTO test_data VALUES (7, 'beer'); 00154 INSERT INTO test_data VALUES (7, 'diapers'); 00155 00156 \endcode 00157 00158 Let \f$ min(support) = .25 \f$ and \f$ min(confidence) = .5 \f$, and the output schema be 'myschema'. For this example, we will set verbose to 'true' so that we have some insight into the progress of the function. We can now generate association rules as follows: 00159 00160 \code 00161 00162 SELECT * FROM MADLIB_SCHEMA.assoc_rules (.25, .5, 'trans_id', 'product', 'test_data','myschema', false); 00163 00164 \endcode 00165 00166 This should generate this output: 00167 00168 \code 00169 00170 output_schema | output_table | total_rules | total_time 00171 ---------------+--------------+-------------+----------------- 00172 myschema | assoc_rules | 7 | 00:00:03.162094 00173 (1 row) 00174 00175 00176 \endcode 00177 00178 The association rules are stored in the myschema.assoc_rules: 00179 00180 \code 00181 select * from myschema.assoc_rules order by support desc; 00182 ruleid | pre | post | support | confidence | lift | conviction 00183 --------+-----------------+----------------+-------------------+-------------------+-------------------+------------------- 00184 4 | {diapers} | {beer} | 0.714285714285714 | 1 | 1 | 0 00185 2 | {beer} | {diapers} | 0.714285714285714 | 0.714285714285714 | 1 | 1 00186 1 | {chips} | {beer} | 0.428571428571429 | 1 | 1 | 0 00187 5 | {chips} | {beer,diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857 00188 6 | {chips,beer} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857 00189 7 | {chips,diapers} | {beer} | 0.285714285714286 | 1 | 1 | 0 00190 3 | {chips} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857 00191 (7 rows) 00192 00193 \endcode 00194 00195 @sa File assoc_rules.sql_in documenting the SQL function. 00196 00197 */ 00198 00199 /* 00200 * @brief The result data type for the association rule API 00201 * 00202 * output_schema the name of the output schema. 00203 * output_table the name of the output table. 00204 * total_rules the number of total rules. 00205 * total_time the running time. 00206 */ 00207 CREATE TYPE MADLIB_SCHEMA.assoc_rules_results AS 00208 ( 00209 output_schema TEXT, 00210 output_table TEXT, 00211 total_rules INT, 00212 total_time INTERVAL 00213 ); 00214 00215 00216 /* 00217 * @brief Given the text form of a closed frequent pattern (cfp), this function 00218 * generates the association rules for that pattern. We use text format 00219 * because text values are hash joinable. The output is a set of text 00220 * array. For example, assuming the input pattern is '1,2,3'. 00221 * The result rules: 00222 * array['1', '2,3'] 00223 * array['2', '1,3'] 00224 * array['3', '1,2'] 00225 * array['1,2', '3'] 00226 * array['1,3', '2'] 00227 * array['2,3', '1'] 00228 * Note that two meaningless rules will be excluded: 00229 * array['1,2,3', NULL] 00230 * array[NULL, '1,2,3'] 00231 * 00232 * @param arg 1 The text form of a closed frequent pattern. 00233 * @param arg 2 The number of items in the pattern. 00234 * 00235 * @return A set of text array. Each array has two elements, corresponding to 00236 * the left and right parts of an association rule. 00237 * 00238 */ 00239 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gen_rules_from_cfp 00240 ( 00241 TEXT, 00242 INT 00243 ) 00244 RETURNS SETOF TEXT[] AS 'MODULE_PATHNAME' 00245 LANGUAGE C STRICT IMMUTABLE; 00246 00247 00248 /** 00249 * 00250 * @param support minimum level of support needed for each itemset to 00251 * be included in result 00252 * @param confidence minimum level of confidence needed for each rule to 00253 * be included in result 00254 * @param tid_col name of the column storing the transaction ids 00255 * @param item_col name of the column storing the products 00256 * @param input_table name of the table where the data is stored 00257 * @param output_schema name of the schema where the final results will be stored 00258 * @param verbose determining if output contains comments 00259 * 00260 * @returns The schema and table name containing association rules, 00261 * and total number of rules found. 00262 * 00263 * This function computes the association rules between products in a data set. 00264 * It reads the name of the table, the column names of the product and ids, and 00265 * computes ssociation rules using the Apriori algorithm, and subject to the 00266 * support and confidence constraints as input by the user. This version of 00267 * association rules has verbose functionality. When verbose is true, output of 00268 * function includes iteration steps and comments on Apriori algorithm steps. 00269 * 00270 */ 00271 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules 00272 ( 00273 support FLOAT8, 00274 confidence FLOAT8, 00275 tid_col TEXT, 00276 item_col TEXT, 00277 input_table TEXT, 00278 output_schema TEXT, 00279 verbose BOOLEAN 00280 ) 00281 RETURNS MADLIB_SCHEMA.assoc_rules_results 00282 AS $$ 00283 00284 PythonFunctionBodyOnly(`assoc_rules', `assoc_rules') 00285 00286 plpy.execute("SET client_min_messages = error;") 00287 00288 # schema_madlib comes from PythonFunctionBodyOnly 00289 return assoc_rules.assoc_rules( 00290 schema_madlib, 00291 support, 00292 confidence, 00293 tid_col, 00294 item_col, 00295 input_table, 00296 output_schema, 00297 verbose 00298 ); 00299 00300 $$ LANGUAGE plpythonu; 00301 00302 00303 /** 00304 * 00305 * @brief The short form of the above function with vobose removed. 00306 * 00307 */ 00308 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules 00309 ( 00310 support FLOAT8, 00311 confidence FLOAT8, 00312 tid_col TEXT, 00313 item_col TEXT, 00314 input_table TEXT, 00315 output_schema TEXT 00316 ) 00317 RETURNS MADLIB_SCHEMA.assoc_rules_results 00318 AS $$ 00319 00320 PythonFunctionBodyOnly(`assoc_rules', `assoc_rules') 00321 00322 plpy.execute("SET client_min_messages = error;") 00323 00324 # schema_madlib comes from PythonFunctionBodyOnly 00325 return assoc_rules.assoc_rules( 00326 schema_madlib, 00327 support, 00328 confidence, 00329 tid_col, 00330 item_col, 00331 input_table, 00332 output_schema, 00333 False 00334 ); 00335 00336 $$ LANGUAGE plpythonu;