User Documentation
 All Files Functions Groups
assoc_rules.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file assoc_rules.sql_in
4  *
5  * @brief The \ref assoc_rules function computes association rules for a given set of data. The data is assumed to have two dimensions; items (between which we are trying to discover associations), and a transaction id. This tranaction id groups the items by event and could also be a user id, date, etc. depending on the context of the data. This function assumes the data is stored in two columns with one transaction id and one item per row.
6 
7  * @date June 2011
8  * @modified August 2012
9  *
10  * @sa For a brief introduction to the association rules implementation, see the module
11  * description \ref grp_assoc_rules.
12  *
13  *//* ----------------------------------------------------------------------- */
14 
15 m4_include(`SQLCommon.m4')
16 
17 
18 /**
19 @addtogroup grp_assoc_rules
20 
21 @about
22 This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds.
23 
24 Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customers' purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields.
25 
26 This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item.
27 
28 <pre>
29  tran_id | product
30  ---------+---------
31  1 | 1
32  1 | 2
33  1 | 3
34  1 | 4
35  2 | 3
36  2 | 4
37  2 | 5
38  3 | 1
39  3 | 4
40  3 | 6
41  ...
42 </pre>
43 
44 \b Rules
45 
46 Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand-side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence.
47 
48 Given any association rule "If X, then Y", the association rules function will also calculate the following metrics:
49 - Support: The ratio of transactions that contain X to all transactions, T
50 \f[
51 S (X) = \frac{Total X}{Total transactions}
52 \f]
53 
54 - Confidence: The ratio of transactions that contain \f$ X,Y \f$ to transactions that contain \f$ X \f$. One could view this metric as the conditional probability of \f$ Y \f$ , given \f$ X \f$ . \f$ P(Y|X) \f$
55 
56 \f[
57 C (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X)}
58 \f]
59 
60 - Lift: The ratio of observed support of \f$ X,Y \f$ to the expected support of \f$ X,Y \f$ , assuming \f$ X \f$ and \f$ Y \f$ are independent.
61 \f[
62 L (X \Rightarrow Y) = \frac{s(X \cap Y )}{s(X) \cdot s(Y)}
63 \f]
64 
65 - Conviction: The ratio of expected support of \f$ X \f$ occurring without\f$ Y \f$ assuming \f$ X \f$ and \f$ \neg Y \f$ are independent, to the observed support of \f$ X \f$ occuring without \f$ Y \f$. If conviction is greater than 1, then this metric shows that incorrect predictions ( \f$ X \Rightarrow Y \f$ ) occur less often than if these two actions were independent. This metric can be viewed as the ratio that the association rule would be incorrect if the actions were independent (i.e. a conviction of 1.5 indicates that if the variables were independent, this rule would be incorrect 50% more often.)
66 
67 \f[
68 Conv (X \Rightarrow Y) = \frac{1 - S(Y)}{1 - C(X \Rightarrow Y)}
69 \f]
70 
71 
72 \b Apriori \b algorithm
73 
74 Although there are many algorithms that generate association rules, the classic algorithm used is called Apriori (which we implemented in this module). It is a breadth-first search, as opposed to depth-first searches like eclat. Frequent itemsets of order \f$ n \f$ are generated from sets of order \f$ n - 1 \f$. Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided:
75 
76 \e Initial \e step
77 -# Generate all itemsets of order 1
78 -# Eliminate itemsets that have support is less than minimum support
79 
80 \e Main \e algorithm
81 -# For \f$ n \ge 2 \f$, generate itemsets of order \f$ n \f$ by combining the itemsets of order \f$ n - 1 \f$. This is done by doing the union of two itemsets that have identical items except one.
82 -# Eliminate itemsets that have (n-1) order subsets with insufficient support
83 -# Eliminate itemsets with insufficient support
84 -# Repeat until itemsets cannot be generated
85 
86 \e Association \e rule \e generation
87 
88 Given a frequent itemset \f$ A \f$ generated from the Apriori algorithm, and all subsets \f$ B \f$ , we generate rules such that \f$ B \Rightarrow (A - B) \f$ meets minimum confidence requirements.
89 
90 @input
91 
92 The input data is expected to be of the following form:
93 <pre>{TABLE|VIEW} <em>input_table</em> (
94  <em>trans_id</em> INTEGER,
95  <em>product</em> TEXT
96 )</pre>
97 
98 The algorithm will map the product names to consective integer ids starting at 1. If they are already structured this way, then the ids will not change.
99 
100 @usage
101 - Association rules can be called by:
102  <pre>SELECT \ref assoc_rules(
103  <em>support</em>, <em>confidence</em>,'<em>tid_col</em>','<em>item_col</em>',
104  '<em>input_table</em>','<em>output_schema</em>', <em> verbose </em>
105  );</pre>
106  This will generate all association rules that meet a minimum support of <em>support</em> and confidence of <em>confidence</em>.
107 
108 - The results containing the rules, support, confidence, lift, and conviction are stored in the table assoc_rules in the schema specified by <em>output_schema</em>.
109 <pre>
110  Table "output_schema.assoc_rules"
111  Column | Type | Modifiers
112  ------------+------------------+-----------
113  ruleid | integer |
114  pre | text[] |
115  post | text[] |
116  support | double precision |
117  confidence | double precision |
118  lift | double precision |
119  conviction | double precision |
120  Distributed by: (ruleid)
121 
122 </pre>
123  The \c pre and \c post are the itemsets of left and right hand sides of the association rule respectively. The \c support, \c confidence, \c lift, and \c conviction columns are calculated as mentioned in the about section.
124 
125 @implementation
126 
127 The association rules function will always create a table named assoc_rules. Please make a copy of this table before running the function again if you would like to keep multiple association rule tables.
128 
129 @examp
130 
131 Let us take a look at some sample transactional data and generate association rules:
132 
133 \code
134 DROP TABLE IF EXISTS test_data;
135 CREATE TABLE test_data (
136  trans_id INT
137  , product text
138 );
139 
140 INSERT INTO test_data VALUES (1, 'beer');
141 INSERT INTO test_data VALUES (1, 'diapers');
142 INSERT INTO test_data VALUES (1, 'chips');
143 INSERT INTO test_data VALUES (2, 'beer');
144 INSERT INTO test_data VALUES (2, 'diapers');
145 INSERT INTO test_data VALUES (3, 'beer');
146 INSERT INTO test_data VALUES (3, 'diapers');
147 INSERT INTO test_data VALUES (4, 'beer');
148 INSERT INTO test_data VALUES (4, 'chips');
149 INSERT INTO test_data VALUES (5, 'beer');
150 INSERT INTO test_data VALUES (6, 'beer');
151 INSERT INTO test_data VALUES (6, 'diapers');
152 INSERT INTO test_data VALUES (6, 'chips');
153 INSERT INTO test_data VALUES (7, 'beer');
154 INSERT INTO test_data VALUES (7, 'diapers');
155 
156 \endcode
157 
158 Let \f$ min(support) = .25 \f$ and \f$ min(confidence) = .5 \f$, and the output schema be 'myschema'. For this example, we will set verbose to 'true' so that we have some insight into the progress of the function. We can now generate association rules as follows:
159 
160 \code
161 
162 SELECT * FROM MADLIB_SCHEMA.assoc_rules (.25, .5, 'trans_id', 'product', 'test_data','myschema', false);
163 
164 \endcode
165 
166 This should generate this output:
167 
168 \code
169 
170  output_schema | output_table | total_rules | total_time
171 ---------------+--------------+-------------+-----------------
172  myschema | assoc_rules | 7 | 00:00:03.162094
173 (1 row)
174 
175 
176 \endcode
177 
178 The association rules are stored in the myschema.assoc_rules:
179 
180 \code
181 select * from myschema.assoc_rules order by support desc;
182  ruleid | pre | post | support | confidence | lift | conviction
183 --------+-----------------+----------------+-------------------+-------------------+-------------------+-------------------
184  4 | {diapers} | {beer} | 0.714285714285714 | 1 | 1 | 0
185  2 | {beer} | {diapers} | 0.714285714285714 | 0.714285714285714 | 1 | 1
186  1 | {chips} | {beer} | 0.428571428571429 | 1 | 1 | 0
187  5 | {chips} | {beer,diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
188  6 | {chips,beer} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
189  7 | {chips,diapers} | {beer} | 0.285714285714286 | 1 | 1 | 0
190  3 | {chips} | {diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
191 (7 rows)
192 
193 \endcode
194 
195 @sa File assoc_rules.sql_in documenting the SQL function.
196 
197 */
198 
199 /*
200  * @brief The result data type for the association rule API
201  *
202  * output_schema the name of the output schema.
203  * output_table the name of the output table.
204  * total_rules the number of total rules.
205  * total_time the running time.
206  */
207 CREATE TYPE MADLIB_SCHEMA.assoc_rules_results AS
208  (
209  output_schema TEXT,
210  output_table TEXT,
211  total_rules INT,
212  total_time INTERVAL
213 );
214 
215 
216 /*
217  * @brief Given the text form of a closed frequent pattern (cfp), this function
218  * generates the association rules for that pattern. We use text format
219  * because text values are hash joinable. The output is a set of text
220  * array. For example, assuming the input pattern is '1,2,3'.
221  * The result rules:
222  * array['1', '2,3']
223  * array['2', '1,3']
224  * array['3', '1,2']
225  * array['1,2', '3']
226  * array['1,3', '2']
227  * array['2,3', '1']
228  * Note that two meaningless rules will be excluded:
229  * array['1,2,3', NULL]
230  * array[NULL, '1,2,3']
231  *
232  * @param arg 1 The text form of a closed frequent pattern.
233  * @param arg 2 The number of items in the pattern.
234  *
235  * @return A set of text array. Each array has two elements, corresponding to
236  * the left and right parts of an association rule.
237  *
238  */
239 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gen_rules_from_cfp
240  (
241  TEXT,
242  INT
243  )
244 RETURNS SETOF TEXT[] AS 'MODULE_PATHNAME'
245 LANGUAGE C STRICT IMMUTABLE;
246 
247 
248 /**
249  *
250  * @param support minimum level of support needed for each itemset to
251  * be included in result
252  * @param confidence minimum level of confidence needed for each rule to
253  * be included in result
254  * @param tid_col name of the column storing the transaction ids
255  * @param item_col name of the column storing the products
256  * @param input_table name of the table where the data is stored
257  * @param output_schema name of the schema where the final results will be stored
258  * @param verbose determining if output contains comments
259  *
260  * @returns The schema and table name containing association rules,
261  * and total number of rules found.
262  *
263  * This function computes the association rules between products in a data set.
264  * It reads the name of the table, the column names of the product and ids, and
265  * computes ssociation rules using the Apriori algorithm, and subject to the
266  * support and confidence constraints as input by the user. This version of
267  * association rules has verbose functionality. When verbose is true, output of
268  * function includes iteration steps and comments on Apriori algorithm steps.
269  *
270  */
271 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules
272  (
273  support FLOAT8,
274  confidence FLOAT8,
275  tid_col TEXT,
276  item_col TEXT,
277  input_table TEXT,
278  output_schema TEXT,
279  verbose BOOLEAN
280  )
281 RETURNS MADLIB_SCHEMA.assoc_rules_results
282 AS $$
283 
284  PythonFunctionBodyOnly(`assoc_rules', `assoc_rules')
285 
286  plpy.execute("SET client_min_messages = error;")
287 
288  # schema_madlib comes from PythonFunctionBodyOnly
289  return assoc_rules.assoc_rules(
290  schema_madlib,
291  support,
292  confidence,
293  tid_col,
294  item_col,
295  input_table,
296  output_schema,
297  verbose
298  );
299 
300 $$ LANGUAGE plpythonu;
301 
302 
303 /**
304  *
305  * @brief The short form of the above function with vobose removed.
306  *
307  */
308 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.assoc_rules
309  (
310  support FLOAT8,
311  confidence FLOAT8,
312  tid_col TEXT,
313  item_col TEXT,
314  input_table TEXT,
315  output_schema TEXT
316  )
317 RETURNS MADLIB_SCHEMA.assoc_rules_results
318 AS $$
319 
320  PythonFunctionBodyOnly(`assoc_rules', `assoc_rules')
321 
322  plpy.execute("SET client_min_messages = error;")
323 
324  # schema_madlib comes from PythonFunctionBodyOnly
325  return assoc_rules.assoc_rules(
326  schema_madlib,
327  support,
328  confidence,
329  tid_col,
330  item_col,
331  input_table,
332  output_schema,
333  False
334  );
335 
336 $$ LANGUAGE plpythonu;