User Documentation
 All Files Functions Groups
lda.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file lda.sql_in
4  *
5  * @brief SQL functions for Latent Dirichlet Allocation
6  * @date Dec 2012
7  *
8  * @sa For an introduction to Latent Dirichlet Allocation models, see the
9  module description \ref grp_lda.
10  *
11  *//* ------------------------------------------------------------------------*/
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 
17 @addtogroup grp_lda
18 
19 @about
20 
21 Latent Dirichlet Allocation (LDA) is an interesting generative probabilistic
22 model for natural texts and has received a lot of attention in recent years.
23 The model is quite versatile, having found uses in problems like automated
24 topic discovery, collaborative filtering, and document classification.
25 
26 The LDA model posits that each document is associated with a mixture of various
27 topics (e.g. a document is related to Topic 1 with probability 0.7, and Topic 2
28 with probability 0.3), and that each word in the document is attributable to
29 one of the document's topics. There is a (symmetric) Dirichlet prior with
30 parameter \f$ \alpha \f$ on each document's topic mixture. In addition, there
31 is another (symmetric) Dirichlet prior with parameter \f$ \beta \f$ on the
32 distribution of words for each topic.
33 
34 The following generative process then defines a distribution over a corpus of
35 documents.
36 
37 - Sample for each topic \f$ i \f$, a per-topic word
38 distribution \f$ \phi_i \f$ from the Dirichlet(\f$\beta\f$) prior.
39 
40 - For each document:
41  - Sample a document length N from a suitable distribution, say, Poisson.
42  - Sample a topic mixture \f$ \theta \f$ for the document from the
43 Dirichlet(\f$\alpha\f$) distribution.
44  - For each of the N words:
45  - Sample a topic \f$ z_n \f$ from the multinomial topic distribution \f$
46  \theta \f$.
47  - Sample a word \f$ w_n \f$ from the multinomial word distribution \f$
48  \phi_{z_n} \f$ associated with topic \f$ z_n \f$.
49 
50 In practice, only the words in each document are observable. The topic mixture
51 of each document and the topic for each word in each document are latent
52 unobservable variables that need to be inferred from the observables, and this
53 is the problem people refer to when they talk about the inference problem for
54 LDA. Exact inference is intractable, but several approximate inference
55 algorithms for LDA have been developed. The simple and effective Gibbs sampling
56 algorithm described in Griffiths and Steyvers [2] appears to be the current
57 algorithm of choice.
58 
59 This implementation provides a parallel and scalable in-database solution for
60 LDA based on Gibbs sampling. Different with the implementations based on MPI or
61 Hadoop Map/Reduce, this implementation builds upon the shared-nothing MPP
62 databases and enables high-performance in-database analytics.
63 
64 @input
65 The \b corpus/dataset to be analyzed is expected to be of the following form:
66 <pre>{TABLE|VIEW} <em>data_table</em> (
67  <em>docid</em> INTEGER,
68  <em>wordid</em> INTEGER,
69  <em>count</em> INTEGER
70 )</pre>
71 where \c docid refers to the document ID, \c wordid is the word ID (the index
72 of a word in the vocabulary), and \c count is the number of occurence of the
73 word in the document.
74 
75 The \b vocabulary/dictionary that indexes all the words found in the corpus is
76 of the following form:
77 <pre>{TABLE|VIEW} <em>vocab_table</em> (
78  <em>wordid</em> INTEGER,
79  <em>word</em> TEXT,
80 )</pre>
81 where \c wordid refers the word ID (the index of a word in the vocabulary) and
82 \c word is the actual word.
83 
84 @usage
85 - The training (i.e. topic inference) can be done with the following function:
86  <pre>
87  SELECT \ref lda_train(
88  <em>'data_table'</em>,
89  <em>'model_table'</em>,
90  <em>'output_data_table'</em>,
91  <em>voc_size</em>,
92  <em>topic_num</em>,
93  <em>iter_num</em>,
94  <em>alpha</em>,
95  <em>beta</em>)
96  </pre>
97 
98  This function stores the resulting model in <tt><em>model_table</em></tt>.
99  The table has only 1 row and is in the following form:
100  <pre>{TABLE} <em>model_table</em> (
101  <em>voc_size</em> INTEGER,
102  <em>topic_num</em> INTEGER,
103  <em>alpha</em> FLOAT,
104  <em>beta</em> FLOAT,
105  <em>model</em> INTEGER[][])
106  </pre>
107 
108  This function also stores the topic counts and the topic assignments in
109  each document in <tt><em>output_data_table</em></tt>. The table is in the
110  following form:
111  <pre>{TABLE} <em>output_data_table</em> (
112  <em>docid</em> INTEGER,
113  <em>wordcount</em> INTEGER,
114  <em>words</em> INTEGER[],
115  <em>counts</em> INTEGER[],
116  <em>topic_count</em> INTEGER[],
117  <em>topic_assignment</em> INTEGER[])
118  </pre>
119 
120 - The prediction (i.e. labelling of test documents using a learned LDA model)
121  can be done with the following function:
122  <pre>
123  SELECT \ref lda_predict(
124  <em>'data_table'</em>,
125  <em>'model_table'</em>,
126  <em>'output_table'</em>);
127  </pre>
128 
129  This function stores the prediction results in
130  <tt><em>output_table</em></tt>. Each row in the table stores the topic
131  distribution and the topic assignments for a docuemnt in the dataset. And
132  the table is in the following form:
133  <pre>{TABLE} <em>output_table</em> (
134  <em>docid</em> INTEGER,
135  <em>wordcount</em> INTEGER,
136  <em>words</em> INTEGER,
137  <em>counts</em> INTEGER,
138  <em>topic_count</em> INTEGER[],
139  <em>topic_assignment</em> INTEGER[])
140  </pre>
141 
142 - This module also provides a function for computing the perplexity:
143  <pre>
144  SELECT \ref lda_get_perplexity(
145  <em>'model_table'</em>,
146  <em>'output_data_table'</em>);
147  </pre>
148 
149 @implementation
150 The input format for this module is very common in many machine learning
151 packages written in various lanugages, which allows users to generate
152 datasets using any existing document preprocessing tools or import existing
153 dataset very conveniently. Internally, the input data will be validated and then
154 converted to the following format for efficiency:
155  <pre>{TABLE} <em>__internal_data_table__</em> (
156  <em>docid</em> INTEGER,
157  <em>wordcount</em> INTEGER,
158  <em>words</em> INTEGER[],
159  <em>counts</em> INTEGER[])
160  </pre>
161 where \c docid is the document ID, \c wordcount is the count of words in the
162 document, \c words is the list of unique words in the document, and \c counts
163 is the list of number of occurence of each unique word in the document. The
164 convertion can be done with the help of aggregation functions very easily.
165 
166 @examp
167 
168 We now give a usage example.
169 
170 - As a first step, we need to prepare a dataset and vocabulary in the appropriate structure.
171  \code
172  CREATE TABLE my_vocab(wordid INT4, word TEXT)
173  m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (wordid)');
174 
175  INSERT INTO my_vocab VALUES
176  (0, 'code'), (1, 'data'), (2, 'graph'), (3, 'image'), (4, 'input'), (5,
177  'layer'), (6, 'learner'), (7, 'loss'), (8, 'model'), (9, 'network'), (10,
178  'neuron'), (11, 'object'), (12, 'output'), (13, 'rate'), (14, 'set'), (15,
179  'signal'), (16, 'sparse'), (17, 'spatial'), (18, 'system'), (19, 'training');
180 
181  CREATE TABLE my_training
182  (
183  docid INT4,
184  wordid INT4,
185  count INT4
186  )
187  m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)');
188 
189  INSERT INTO my_training VALUES
190  (0, 0, 2),(0, 3, 2),(0, 5, 1),(0, 7, 1),(0, 8, 1),(0, 9, 1),(0, 11, 1),(0, 13,
191  1), (1, 0, 1),(1, 3, 1),(1, 4, 1),(1, 5, 1),(1, 6, 1),(1, 7, 1),(1, 10, 1),(1,
192  14, 1),(1, 17, 1),(1, 18, 1), (2, 4, 2),(2, 5, 1),(2, 6, 2),(2, 12, 1),(2, 13,
193  1),(2, 15, 1),(2, 18, 2), (3, 0, 1),(3, 1, 2),(3, 12, 3),(3, 16, 1),(3, 17,
194  2),(3, 19, 1), (4, 1, 1),(4, 2, 1),(4, 3, 1),(4, 5, 1),(4, 6, 1),(4, 10, 1),(4,
195  11, 1),(4, 14, 1),(4, 18, 1),(4, 19, 1), (5, 0, 1),(5, 2, 1),(5, 5, 1),(5, 7,
196  1),(5, 10, 1),(5, 12, 1),(5, 16, 1),(5, 18, 1),(5, 19, 2), (6, 1, 1),(6, 3,
197  1),(6, 12, 2),(6, 13, 1),(6, 14, 2),(6, 15, 1),(6, 16, 1),(6, 17, 1), (7, 0,
198  1),(7, 2, 1),(7, 4, 1),(7, 5, 1),(7, 7, 2),(7, 8, 1),(7, 11, 1),(7, 14, 1),(7,
199  16, 1), (8, 2, 1),(8, 4, 4),(8, 6, 2),(8, 11, 1),(8, 15, 1),(8, 18, 1),
200  (9, 0, 1),(9, 1, 1),(9, 4, 1),(9, 9, 2),(9, 12, 2),(9, 15, 1),(9, 18, 1),(9,
201  19, 1);
202 
203 
204  CREATE TABLE my_testing
205  (
206  docid INT4,
207  wordid INT4,
208  count INT4
209  )
210  m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)');
211 
212  INSERT INTO my_testing VALUES
213  (0, 0, 2),(0, 8, 1),(0, 9, 1),(0, 10, 1),(0, 12, 1),(0, 15, 2),(0, 18, 1),(0,
214  19, 1), (1, 0, 1),(1, 2, 1),(1, 5, 1),(1, 7, 1),(1, 12, 2),(1, 13, 1),(1, 16,
215  1),(1, 17, 1),(1, 18, 1), (2, 0, 1),(2, 1, 1),(2, 2, 1),(2, 3, 1),(2, 4, 1),(2,
216  5, 1),(2, 6, 1),(2, 12, 1),(2, 14, 1),(2, 18, 1), (3, 2, 2),(3, 6, 2),(3, 7,
217  1),(3, 9, 1),(3, 11, 2),(3, 14, 1),(3, 15, 1), (4, 1, 1),(4, 2, 2),(4, 3,
218  1),(4, 5, 2),(4, 6, 1),(4, 11, 1),(4, 18, 2);
219  \endcode
220 
221 - To perform training, we call the lda_train() function with the
222 appropriate parameters. Here is an example.
223  \code
224  SELECT MADLib.lda_train(
225  'my_training', 'my_model', 'my_outdata', 20, 5, 10, 5, 0.01);
226  \endcode
227 
228  After a successful run of the lda_train() function, two tables will be
229  generated, one for storing the learned models, and another for storing the
230  output data table.
231 
232  To get the detailed information about the learned model, we can run the
233  following commands:
234 
235  - The topic description by top-k words
236  \code
237  SELECT * FROM MADLib.lda_get_topic_desc(
238  'my_model', 'my_vocab', 'my_topic_desc', 15);
239  \endcode
240 
241  - The per-topic word counts
242  \code
243  SELECT MADLib.lda_get_topic_word_count(
244  'my_model', 'my_topic_word_count');
245  \endcode
246 
247  - The per-word topic counts
248  \code
249  SELECT MADLib.lda_get_word_topic_count(
250  'my_model', 'my_word_topic_count');
251  \endcode
252 
253  To get the topic counts and the topic assignments for each doucment, we
254  can run the following commands:
255 
256  - The per-document topic counts:
257  \code
258  SELECT
259  docid, topic_count
260  FROM my_outdata;
261  \endcode
262 
263  - The per-document topic assignments:
264  \code
265  SELECT
266  docid, words, counts, topic_assignment
267  FROM my_outdata;
268  \endcode
269  By scanning \c words, \c counts, and \c topic_assignment together, we can
270  get the topic assignment for each word in a document.
271 
272 - To use a learned LDA model for prediction (i.e. to label new documents), we can use the following command:
273  \code
274  SELECT MADLib.lda_predict(
275  'my_testing', 'my_model', 'my_pred');
276  \endcode
277 
278  After a successful run of the lda_predict() function, the prediction
279  results will be generated and stored in <em>my_pred</em>. This table has
280  the same schema as the <em>my_outdata</em> generated by the lda_train()
281  function.
282 
283  To get te the topic counts and the topic assignments for each doucment, we
284  can run the following commands:
285 
286  - The per-document topic counts:
287  \code
288  SELECT
289  docid, topic_count
290  FROM my_pred;
291  \endcode
292 
293  - The per-document topic assignments:
294  \code
295  SELECT
296  docid, words, counts, topic_assignment
297  FROM my_pred;
298  \endcode
299  By scanning \c words, \c counts, and \c topic_assignment together, we can
300  get the topic assignment for each word in a document.
301 
302 - To compute the perplexity, we can use the following command:
303  \code
304  SELECT MADLib.lda_get_perplexity(
305  'my_model', 'my_pred');
306  \endcode
307 
308 @literature
309 
310 [1] D.M. Blei, A.Y. Ng, M.I. Jordan, <em>Latent Dirichlet Allocation</em>,
311  Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
312 
313 [2] T. Griffiths and M. Steyvers, <em>Finding scientific topics</em>, PNAS,
314  vol. 101, pp. 5228-5235, 2004.
315 
316 [3] Y. Wang, H. Bai, M. Stanton, W-Y. Chen, and E.Y. Chang, <em>lda: Parallel
317  Dirichlet Allocation for Large-scale Applications</em>, AAIM, 2009.
318 
319 [4] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
320 
321 [5] J. Chang, Collapsed Gibbs sampling methods for topic models, R manual,
322  2010.
323 
324 @sa File lda.sql_in documenting the SQL functions.
325 */
326 
327 -- UDT for summarizing a UDF call
328 DROP TYPE IF EXISTS MADLIB_SCHEMA.lda_result;
329 CREATE TYPE MADLIB_SCHEMA.lda_result AS
330 (
331  output_table TEXT,
332  description TEXT
333 );
334 
335 /**
336  * @brief This UDF provides an entry for the lda training process.
337  * @param data_table Table storing the training dataset, each row is in
338  * the form of <docid, wordid, count> where docid,
339  * wordid, and count are all non-negative integers.
340  * @param voc_size Size of the vocabulary (Note that the wordid should
341  * be continous integers starting from 0 to voc_size -
342  * 1. A data validation rountine will be called to
343  * validate the dataset.)
344  * @param topic_num Number of topics (e.g. 100)
345  * @param iter_num Number of iterations (e.g. 60)
346  * @param alpha Dirichlet parameter for the per-doc topic multinomial
347  * (e.g. 50/topic_num)
348  * @param beta Dirichlet parameter for the per-topic word multinomial
349  * (e.g. 0.01)
350  * @param model_table Table storing the learned models (voc_size, topic_num,
351  * alpha, beta, per-word topic counts, and
352  * corpus-level topic counts)
353  * @param output_data_table Table storing the output data table in the form of
354  * <docid, wordcount, words, counts, topic_count,
355  * topic_assignment>
356  **/
357 CREATE OR REPLACE FUNCTION
358 MADLIB_SCHEMA.lda_train
359 (
360  data_table TEXT,
361  model_table TEXT,
362  output_data_table TEXT,
363  voc_size INT4,
364  topic_num INT4,
365  iter_num INT4,
366  alpha FLOAT8,
367  beta FLOAT8
368 )
369 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
370  PythonFunctionBodyOnly(`lda', `lda')
371  lda.lda_train(
372  schema_madlib, data_table, model_table, output_data_table, voc_size,
373  topic_num, iter_num, alpha, beta
374  )
375  return [[model_table, 'model table'],
376  [output_data_table, 'output data table']]
377 $$ LANGUAGE PLPYTHONU STRICT;
378 
379 
380 /**
381  * @brief This UDF provides an entry for the lda predicton process.
382  * @param data_table Table storing the testing dataset, each row is in the
383  * form of <docid, wordid, count>
384  * where docid, wordid, and count are all non-negative
385  * integers.
386  * @param model_table Table storing the learned models
387  * @param output_table Table storing per-document topic counts and topic
388  * assignments
389  * @note default iter_num = 20
390  **/
391 CREATE OR REPLACE FUNCTION
392 MADLIB_SCHEMA.lda_predict
393 (
394  data_table TEXT,
395  model_table TEXT,
396  output_table TEXT
397 )
398 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
399  PythonFunctionBodyOnly(`lda', `lda')
400  lda.lda_predict(schema_madlib, data_table, model_table, output_table)
401  return [[
402  output_table,
403  'per-doc topic distribution and per-word topic assignments']]
404 $$ LANGUAGE PLPYTHONU STRICT;
405 
406 /**
407  * @brief A overloaded version which allows users to specify iter_num.
408  **/
409 CREATE OR REPLACE FUNCTION
410 MADLIB_SCHEMA.lda_predict
411 (
412  data_table TEXT,
413  model_table TEXT,
414  output_table TEXT,
415  iter_num INT4
416 )
417 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
418  PythonFunctionBodyOnly(`lda', `lda')
419  lda.lda_predict(
420  schema_madlib, data_table, model_table, output_table, iter_num)
421  return [[
422  output_table,
423  'per-doc topic distribution and per-word topic assignments']]
424 $$ LANGUAGE PLPYTHONU STRICT;
425 /**
426  * @brief This UDF computes the per-topic word counts.
427  * @param model_table The model table generated by the training process
428  * @param output_table The output table storing the per-topic word counts
429  **/
430 CREATE OR REPLACE FUNCTION
431 MADLIB_SCHEMA.lda_get_topic_word_count
432 (
433  model_table TEXT,
434  output_table TEXT
435 )
436 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
437  PythonFunctionBodyOnly(`lda', `lda')
438  lda.get_topic_word_count(schema_madlib, model_table, output_table)
439  return [[output_table, 'per-topic word counts']]
440 $$ LANGUAGE plpythonu STRICT;
441 
442 /**
443  * @brief This UDF computes the per-word topic counts.
444  * @param model_table The model table generated by the training process
445  * @param dist_table The output table storing the per-word topic counts
446  **/
447 CREATE OR REPLACE FUNCTION
448 MADLIB_SCHEMA.lda_get_word_topic_count
449 (
450  model_table TEXT,
451  output_table TEXT
452 )
453 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
454  PythonFunctionBodyOnly(`lda', `lda')
455  lda.get_word_topic_count(schema_madlib, model_table, output_table)
456  return [[output_table, 'per-word topic counts']]
457 $$ LANGUAGE plpythonu STRICT;
458 
459 /**
460  * @brief This UDF gets the description for each topic (top-k words)
461  * @param model_table The model table generated by the training process
462  * @param vocab_table The vocabulary table (<wordid, word>)
463  * @param top_k The number of top words for each topic description
464  * @param desc_table The output table for storing the per-topic description
465  **/
466 CREATE OR REPLACE FUNCTION
467 MADLIB_SCHEMA.lda_get_topic_desc
468 (
469  model_table TEXT,
470  vocab_table TEXT,
471  desc_table TEXT,
472  top_k INT4
473 )
474 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
475  PythonFunctionBodyOnly(`lda', `lda')
476  lda.get_topic_desc(
477  schema_madlib, model_table, vocab_table, desc_table, top_k)
478  return [[
479  desc_table,
480  """topic description, use "ORDER BY topicid, prob DESC" to check the
481  results"""]]
482 $$ LANGUAGE plpythonu STRICT;
483 
484 /**
485  * @brief This UDF assigns topics to words in a document randomly.
486  * @param word_count The number of words in the document
487  * @param topic_num The number of topics (specified by the user)
488  * @return The topic counts and topic assignments
489  **/
490 CREATE OR REPLACE FUNCTION
491 MADLIB_SCHEMA.__lda_random_assign
492 (
493  word_count INT4,
494  topic_num INT4
495 )
496 RETURNS INT4[]
497 AS 'MODULE_PATHNAME', 'lda_random_assign'
498 LANGUAGE C STRICT;
499 
500 /**
501  * @brief This UDF learns the topics of words in a document and is the main
502  * step of a Gibbs sampling iteration. The model parameter (including the
503  * per-word topic counts and corpus-level topic counts) is passed to this
504  * function in the first call and then transfered to the rest calls through
505  * fcinfo->flinfo->fn_extra to allow the immediate update.
506  * @param words The set of unique words in the document
507  * @param counts The counts of each unique words in the document
508  * (sum(counts) = word_count)
509  * @param doc_topic The current per-doc topic counts and topic
510  * assignments
511  * @param model The current model (including the per-word topic counts
512  * and the corpus-level topic counts)
513  * @param alpha The Dirichlet parameter for per-document topic multinomial
514  * @param beta The Dirichlet parameter for per-topic word multinomial
515  * @param voc_size The size of vocabulary
516  * @param topic_num The number of topics
517  * @return The learned topic counts and topic assignments
518  **/
519 CREATE OR REPLACE FUNCTION
520 MADLIB_SCHEMA.__lda_gibbs_sample
521 (
522  words INT4[],
523  counts INT4[],
524  doc_topic INT4[],
525  model INT4[],
526  alpha FLOAT8,
527  beta FLOAT8,
528  voc_size INT4,
529  topic_num INT4,
530  iter_num INT4
531 )
532 RETURNS INT4[]
533 AS 'MODULE_PATHNAME', 'lda_gibbs_sample'
534 LANGUAGE C;
535 
536 /**
537  * @brief This UDF is the sfunc for the aggregator computing the topic counts
538  * for each word and the topic count in the whole corpus. It scans the topic
539  * assignments in a document and updates the topic counts.
540  * @param state The topic counts
541  * @param words The unique words in the document
542  * @param counts The counts of each unique words in the document
543  * (sum(counts) = word_count)
544  * @param topic_assignment The topic assignments in the document
545  * @param topic_num The number of topics
546  * @return The updated state
547  **/
548 CREATE OR REPLACE FUNCTION
549 MADLIB_SCHEMA.__lda_count_topic_sfunc
550 (
551  state INT4[],
552  words INT4[],
553  counts INT4[],
554  topic_assignment INT4[],
555  voc_size INT4,
556  topic_num INT4
557 )
558 RETURNS INT4[]
559 AS 'MODULE_PATHNAME', 'lda_count_topic_sfunc'
560 LANGUAGE C;
561 
562 /**
563  * @brief This UDF is the prefunc for the aggregator computing the per-word
564  * topic counts.
565  * @param state1 The local word topic counts
566  * @param state2 The local word topic counts
567  * @return The element-wise sum of two local states
568  **/
569 CREATE OR REPLACE FUNCTION
570 MADLIB_SCHEMA.__lda_count_topic_prefunc
571 (
572  state1 INT4[],
573  state2 INT4[]
574 )
575 RETURNS INT4[]
576 AS 'MODULE_PATHNAME', 'lda_count_topic_prefunc'
577 LANGUAGE C IMMUTABLE STRICT;
578 
579 /**
580  * @brief This uda computes the word topic counts by scanning and summing
581  * up topic assignments in each document.
582  * @param words The unique words in the document
583  * @param counts The counts of each unique words in the document
584  * @param topic_assignment The topic assignments in the document
585  * @param voc_size The size of vocabulary
586  * @param topic_num The number of topics
587  * @return The word topic counts (a 1-d array embeding a 2-d array)
588  **/
589 DROP AGGREGATE IF EXISTS
590 MADLIB_SCHEMA.__lda_count_topic_agg
591 (
592  INT4[],
593  INT4[],
594  INT4[],
595  INT4,
596  INT4
597 );
598 CREATE AGGREGATE
599 MADLIB_SCHEMA.__lda_count_topic_agg
600 (
601  INT4[],
602  INT4[],
603  INT4[],
604  INT4,
605  INT4
606 )
607 (
608  stype = INT4[],
609  sfunc = MADLIB_SCHEMA.__lda_count_topic_sfunc
610  m4_ifdef(
611  `GREENPLUM',
612  `, prefunc = MADLIB_SCHEMA.__lda_count_topic_prefunc'
613  )
614 );
615 
616 /**
617  * @brief This UDF computes the perplexity given the output data table and the
618  * model table.
619  * @param model_table The model table generated by lda_train
620  * @param output_table The output data table generated by lda_predict
621  * @return The perplexity
622  **/
623 CREATE OR REPLACE FUNCTION
624 MADLIB_SCHEMA.lda_get_perplexity
625 (
626  model_table TEXT,
627  output_data_table TEXT
628 )
629 RETURNS FLOAT8 AS $$
630  PythonFunctionBodyOnly(`lda', `lda')
631  return lda.get_perplexity(
632  schema_madlib, model_table, output_data_table)
633 $$ LANGUAGE plpythonu STRICT;
634 
635 /**
636  * @brief This UDF is the sfunc for the aggregator computing the perpleixty.
637  * @param state The cached model plus perplexity
638  * @param words The unique words in the document
639  * @param counts The counts of each unique words in the document
640  * @param doc_topic The topic counts in the document
641  * @param model The learned model
642  * @param alpha The Dirichlet parameter for per-document topic multinomial
643  * @param beta The Dirichlet parameter for per-topic word multinomial
644  * @param voc_size The size of vocabulary
645  * @param topic_num The number of topics
646  * @return The updated state
647  **/
648 CREATE OR REPLACE FUNCTION
649 MADLIB_SCHEMA.__lda_perplexity_sfunc
650 (
651  state INT4[],
652  words INT4[],
653  counts INT4[],
654  doc_topic INT4[],
655  model INT4[][],
656  alpha FLOAT8,
657  beta FLOAT8,
658  voc_size INT4,
659  topic_num INT4
660 )
661 RETURNS INT4[]
662 AS 'MODULE_PATHNAME', 'lda_perplexity_sfunc'
663 LANGUAGE C IMMUTABLE;
664 
665 /**
666  * @brief This UDF is the prefunc for the aggregator computing the perplexity.
667  * @param state1 The local state
668  * @param state2 The local state
669  * @return The merged state
670  **/
671 CREATE OR REPLACE FUNCTION
672 MADLIB_SCHEMA.__lda_perplexity_prefunc
673 (
674  state1 INT4[],
675  state2 INT4[]
676 )
677 RETURNS INT4[]
678 AS 'MODULE_PATHNAME', 'lda_perplexity_prefunc'
679 LANGUAGE C IMMUTABLE STRICT;
680 
681 /**
682  * @brief This UDF is the finalfunc for the aggregator computing the perplexity.
683  * @param state The merged state
684  * @return The perpleixty
685  **/
686 CREATE OR REPLACE FUNCTION
687 MADLIB_SCHEMA.__lda_perplexity_ffunc
688 (
689  state INT4[]
690 )
691 RETURNS FLOAT8
692 AS 'MODULE_PATHNAME', 'lda_perplexity_ffunc'
693 LANGUAGE C IMMUTABLE STRICT;
694 
695 /**
696  * @brief This aggregator computes the perpleixty.
697  * @param words The unique words in the document
698  * @param counts The counts of each unique words in the document
699  * @param doc_topic The topic counts in the document
700  * @param model The learned model
701  * @param alpha The Dirichlet parameter for per-document topic multinomial
702  * @param beta The Dirichlet parameter for per-topic word multinomial
703  * @param voc_size The size of vocabulary
704  * @param topic_num The number of topics
705  * @return The updated perplexity
706  **/
707 DROP AGGREGATE IF EXISTS
708 MADLIB_SCHEMA.__lda_perplexity_agg
709 (
710  INT4[],
711  INT4[],
712  INT4[],
713  INT4[],
714  FLOAT8,
715  FLOAT8,
716  INT4,
717  INT4
718 );
719 CREATE AGGREGATE
720 MADLIB_SCHEMA.__lda_perplexity_agg
721 (
722  INT4[],
723  INT4[],
724  INT4[],
725  INT4[],
726  FLOAT8,
727  FLOAT8,
728  INT4,
729  INT4
730 )
731 (
732  stype = INT4[],
733  sfunc = MADLIB_SCHEMA.__lda_perplexity_sfunc,
734  finalfunc = MADLIB_SCHEMA.__lda_perplexity_ffunc
735  m4_ifdef(
736  `GREENPLUM',
737  `, prefunc = MADLIB_SCHEMA.__lda_perplexity_prefunc'
738  )
739 );
740 
741 /**
742  * @brief Unnest a 2-D array into a set of 1-D arrays
743  * @param arr The 2-D array to be unnested
744  * @return The unnested 1-D arrays
745  **/
746 CREATE OR REPLACE FUNCTION
747 MADLIB_SCHEMA.__lda_util_unnest
748 (
749  arr INT4[][]
750 )
751 RETURNS SETOF INT4[]
752 AS 'MODULE_PATHNAME', 'lda_unnest'
753 LANGUAGE C IMMUTABLE STRICT;
754 
755 /**
756  * @brief Transpose a 2-D array
757  * @param matrix The input 2-D array
758  * @param The transposed array
759  **/
760 CREATE OR REPLACE FUNCTION
761 MADLIB_SCHEMA.__lda_util_transpose
762 (
763  matrix INT4[][]
764 )
765 RETURNS INT4[][]
766 AS 'MODULE_PATHNAME', 'lda_transpose'
767 LANGUAGE C IMMUTABLE STRICT;
768 
769 /**
770  * @brief L1 normalization with smoothing
771  * @param arr The array to be normalized
772  * @param smooth The smoothing parameter
773  * @return The normalized vector
774  **/
775 CREATE OR REPLACE FUNCTION
776 MADLIB_SCHEMA.__lda_util_norm_with_smoothing
777 (
778  arr FLOAT8[],
779  smooth FLOAT8
780 )
781 RETURNS FLOAT8[] AS $$
782  PythonFunctionBodyOnly(`lda', `lda')
783  return lda.l1_norm_with_smoothing(arr, smooth)
784 $$ LANGUAGE PLPYTHONU STRICT;
785 
786 /**
787  * @brief This UDF returns the index of elements in a sorted order
788  * @param arr The array to be sorted
789  * @return The index of elements
790  **/
791 CREATE OR REPLACE FUNCTION
792 MADLIB_SCHEMA.__lda_util_index_sort
793 (
794  arr FLOAT8[]
795 )
796 RETURNS INT4[] AS $$
797  PythonFunctionBodyOnly(`lda', `lda')
798  return lda.index_sort(arr)
799 $$ LANGUAGE plpythonu STRICT;
800 
801 /**
802  * @brief This UDF checks the vocabulary and converts non-continous wordids into
803  * continuous integers ranging from 0 to voc_size - 1.
804  * @param vocab_table The vocabulary table in the form of
805  <wordid::int4, word::text>
806  * @param output_vocab_table The regularized vocabulary table
807  **/
808 CREATE OR REPLACE FUNCTION
809 MADLIB_SCHEMA.__lda_util_norm_vocab
810 (
811  vocab_table TEXT,
812  output_vocab_table TEXT
813 )
814 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
815  PythonFunctionBodyOnly(`lda', `lda')
816  lda.norm_vocab(vocab_table, output_vocab_table)
817  return [[output_vocab_table,'normalized vocbulary table']]
818 $$ LANGUAGE plpythonu STRICT;
819 
820 /**
821  * @brief This UDF converts the data table according to the normalized
822  * vocabulary, and all rows with non-positive count values will be removed
823  * @param data_table The data table to be normalized
824  * @param vocab_table The normalized vocabulary table
825  * @param output_data_table The normalized data table
826  **/
827 CREATE OR REPLACE FUNCTION
828 MADLIB_SCHEMA.__lda_util_norm_dataset
829 (
830  data_table TEXT,
831  norm_vocab_table TEXT,
832  output_data_table TEXT
833 )
834 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
835  PythonFunctionBodyOnly(`lda', `lda')
836  lda.norm_dataset(data_table, norm_vocab_table, output_data_table)
837  return [[output_data_table,'normalized data table']]
838 $$ LANGUAGE plpythonu STRICT;
839 
840 /**
841  * @brief This UDF extracts the list of wordids from the data table and joins
842  * it with the vocabulary table to get the list of common wordids, next it will
843  * normalize the vocabulary based on the common wordids and then normalize the
844  * data table based on the normalized vocabulary.
845  * @param data_table The data table to be normalized
846  * @param vocab_table The vocabulary table to be normalized
847  * @param output_data_table The normalized data table
848  * @param output_vocab_table The normalized vocabulary table
849  **/
850 CREATE OR REPLACE FUNCTION
851 MADLIB_SCHEMA.__lda_util_conorm_data
852 (
853  data_table TEXT,
854  vocab_table TEXT,
855  output_data_table TEXT,
856  output_vocab_table TEXT
857 )
858 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
859  PythonFunctionBodyOnly(`lda', `lda')
860  lda.conorm_data(
861  data_table, vocab_table, output_data_table, output_vocab_table)
862  return [[output_data_table,'normalized data table'],
863  [output_vocab_table,'normalized vocab table']]
864 $$ LANGUAGE plpythonu STRICT;
865