User Documentation
lda.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//** 
00002  *
00003  * @file lda.sql_in
00004  *
00005  * @brief SQL functions for Latent Dirichlet Allocation
00006  * @date Dec 2012
00007  *
00008  * @sa For an introduction to Latent Dirichlet Allocation models, see the 
00009        module description \ref grp_lda.
00010  *
00011  *//* ------------------------------------------------------------------------*/
00012 
00013 m4_include(`SQLCommon.m4')
00014 
00015 /**
00016 
00017 @addtogroup grp_lda
00018 
00019 @about
00020 
00021 Latent Dirichlet Allocation (LDA) is an interesting generative probabilistic
00022 model for natural texts and has received a lot of attention in recent years. 
00023 The model is quite versatile, having found uses in problems like automated 
00024 topic discovery, collaborative filtering, and document classification.
00025 
00026 The LDA model posits that each document is associated with a mixture of various
00027 topics (e.g. a document is related to Topic 1 with probability 0.7, and Topic 2
00028 with probability 0.3), and that each word in the document is attributable to
00029 one of the document's topics. There is a (symmetric) Dirichlet prior with
00030 parameter \f$ \alpha \f$ on each document's topic mixture. In addition, there
00031 is another (symmetric) Dirichlet prior with parameter \f$ \beta \f$ on the
00032 distribution of words for each topic.
00033 
00034 The following generative process then defines a distribution over a corpus of
00035 documents. 
00036 
00037 - Sample for each topic \f$ i \f$, a per-topic word
00038 distribution \f$ \phi_i \f$ from the Dirichlet(\f$\beta\f$) prior. 
00039 
00040 - For each document:
00041     - Sample a document length N from a suitable distribution, say, Poisson.
00042     - Sample a topic mixture \f$ \theta \f$ for the document from the
00043 Dirichlet(\f$\alpha\f$) distribution.  
00044     - For each of the N words:
00045         - Sample a topic \f$ z_n \f$ from the multinomial topic distribution \f$
00046    \theta \f$.  
00047         - Sample a word \f$ w_n \f$ from the multinomial word distribution \f$
00048    \phi_{z_n} \f$ associated with topic \f$ z_n \f$.
00049 
00050 In practice, only the words in each document are observable. The topic mixture
00051 of each document and the topic for each word in each document are latent
00052 unobservable variables that need to be inferred from the observables, and this
00053 is the problem people refer to when they talk about the inference problem for
00054 LDA. Exact inference is intractable, but several approximate inference
00055 algorithms for LDA have been developed. The simple and effective Gibbs sampling
00056 algorithm described in Griffiths and Steyvers [2] appears to be the current
00057 algorithm of choice.
00058 
00059 This implementation provides a parallel and scalable in-database solution for
00060 LDA based on Gibbs sampling. Different with the implementations based on MPI or
00061 Hadoop Map/Reduce, this implementation builds upon the shared-nothing MPP
00062 databases and enables high-performance in-database analytics.
00063 
00064 @input
00065 The \b corpus/dataset to be analyzed is expected to be of the following form:
00066 <pre>{TABLE|VIEW} <em>data_table</em> (
00067     <em>docid</em> INTEGER,
00068     <em>wordid</em> INTEGER,
00069     <em>count</em> INTEGER
00070 )</pre>
00071 where \c docid refers to the document ID, \c wordid is the word ID (the index
00072 of a word in the vocabulary), and \c count is the number of occurence of the
00073 word in the document. 
00074 
00075 The \b vocabulary/dictionary that indexes all the words found in the corpus is
00076 of the following form:
00077 <pre>{TABLE|VIEW} <em>vocab_table</em> (
00078     <em>wordid</em> INTEGER,
00079     <em>word</em> TEXT,
00080 )</pre>
00081 where \c wordid refers the word ID (the index of a word in the vocabulary) and
00082 \c word is the actual word.
00083 
00084 @usage
00085 - The training (i.e. topic inference) can be done with the following function:
00086     <pre>
00087         SELECT \ref lda_train(
00088             <em>'data_table'</em>,
00089             <em>'model_table'</em>,
00090             <em>'output_data_table'</em>, 
00091             <em>voc_size</em>, 
00092             <em>topic_num</em>,
00093             <em>iter_num</em>, 
00094             <em>alpha</em>, 
00095             <em>beta</em>)
00096     </pre>
00097     
00098     This function stores the resulting model in <tt><em>model_table</em></tt>.
00099     The table has only 1 row and is in the following form:
00100     <pre>{TABLE} <em>model_table</em> (
00101         <em>voc_size</em> INTEGER,
00102         <em>topic_num</em> INTEGER,
00103         <em>alpha</em> FLOAT,
00104         <em>beta</em> FLOAT,
00105         <em>model</em> INTEGER[][])
00106     </pre>
00107 
00108     This function also stores the topic counts and the topic assignments in
00109     each document in <tt><em>output_data_table</em></tt>. The table is in the
00110     following form:
00111     <pre>{TABLE} <em>output_data_table</em> (
00112         <em>docid</em> INTEGER,
00113         <em>wordcount</em> INTEGER,
00114         <em>words</em> INTEGER[],
00115         <em>counts</em> INTEGER[],
00116         <em>topic_count</em> INTEGER[],
00117         <em>topic_assignment</em> INTEGER[])
00118     </pre>
00119 
00120 - The prediction (i.e. labelling of test documents using a learned LDA model)
00121   can be done with the following function: 
00122     <pre>
00123         SELECT \ref lda_predict(
00124             <em>'data_table'</em>,
00125             <em>'model_table'</em>,
00126             <em>'output_table'</em>);
00127     </pre>
00128     
00129     This function stores the prediction results in
00130     <tt><em>output_table</em></tt>. Each row in the table stores the topic
00131     distribution and the topic assignments for a docuemnt in the dataset. And
00132     the table is in the following form: 
00133     <pre>{TABLE} <em>output_table</em> (
00134         <em>docid</em> INTEGER,
00135         <em>wordcount</em> INTEGER,
00136         <em>words</em> INTEGER,
00137         <em>counts</em> INTEGER,
00138         <em>topic_count</em> INTEGER[],
00139         <em>topic_assignment</em> INTEGER[])
00140     </pre>
00141 
00142 - This module also provides a function for computing the perplexity:
00143     <pre>
00144         SELECT \ref lda_get_perplexity(
00145             <em>'model_table'</em>,
00146             <em>'output_data_table'</em>);
00147     </pre>
00148 
00149 @implementation
00150 The input format for this module is very common in many machine learning
00151 packages written in various lanugages, which allows users to generate
00152 datasets using any existing document preprocessing tools or import existing
00153 dataset very conveniently. Internally, the input data will be validated and then
00154 converted to the following format for efficiency: 
00155     <pre>{TABLE} <em>__internal_data_table__</em> (
00156         <em>docid</em> INTEGER,
00157         <em>wordcount</em> INTEGER,
00158         <em>words</em> INTEGER[],
00159         <em>counts</em> INTEGER[])
00160     </pre>
00161 where \c docid is the document ID, \c wordcount is the count of words in the
00162 document, \c words is the list of unique words in the document, and \c counts
00163 is the list of number of occurence of each unique word in the document. The
00164 convertion can be done with the help of aggregation functions very easily.
00165 
00166 @examp
00167 
00168 We now give a usage example.
00169 
00170 - As a first step, we need to prepare a dataset and vocabulary in the appropriate structure.
00171     \code
00172     CREATE TABLE my_vocab(wordid INT4, word TEXT)
00173     m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (wordid)');
00174     
00175     INSERT INTO my_vocab VALUES
00176     (0, 'code'), (1, 'data'), (2, 'graph'), (3, 'image'), (4, 'input'), (5,
00177     'layer'), (6, 'learner'), (7, 'loss'), (8, 'model'), (9, 'network'), (10,
00178     'neuron'), (11, 'object'), (12, 'output'), (13, 'rate'), (14, 'set'), (15,
00179     'signal'), (16, 'sparse'), (17, 'spatial'), (18, 'system'), (19, 'training');
00180     
00181     CREATE TABLE my_training 
00182     (
00183         docid INT4, 
00184         wordid INT4, 
00185         count INT4
00186     )
00187     m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)');
00188     
00189     INSERT INTO my_training VALUES
00190     (0, 0, 2),(0, 3, 2),(0, 5, 1),(0, 7, 1),(0, 8, 1),(0, 9, 1),(0, 11, 1),(0, 13,
00191     1), (1, 0, 1),(1, 3, 1),(1, 4, 1),(1, 5, 1),(1, 6, 1),(1, 7, 1),(1, 10, 1),(1,
00192     14, 1),(1, 17, 1),(1, 18, 1), (2, 4, 2),(2, 5, 1),(2, 6, 2),(2, 12, 1),(2, 13,
00193     1),(2, 15, 1),(2, 18, 2), (3, 0, 1),(3, 1, 2),(3, 12, 3),(3, 16, 1),(3, 17,
00194     2),(3, 19, 1), (4, 1, 1),(4, 2, 1),(4, 3, 1),(4, 5, 1),(4, 6, 1),(4, 10, 1),(4,
00195     11, 1),(4, 14, 1),(4, 18, 1),(4, 19, 1), (5, 0, 1),(5, 2, 1),(5, 5, 1),(5, 7,
00196     1),(5, 10, 1),(5, 12, 1),(5, 16, 1),(5, 18, 1),(5, 19, 2), (6, 1, 1),(6, 3,
00197     1),(6, 12, 2),(6, 13, 1),(6, 14, 2),(6, 15, 1),(6, 16, 1),(6, 17, 1), (7, 0,
00198     1),(7, 2, 1),(7, 4, 1),(7, 5, 1),(7, 7, 2),(7, 8, 1),(7, 11, 1),(7, 14, 1),(7,
00199     16, 1), (8, 2, 1),(8, 4, 4),(8, 6, 2),(8, 11, 1),(8, 15, 1),(8, 18, 1),
00200     (9, 0, 1),(9, 1, 1),(9, 4, 1),(9, 9, 2),(9, 12, 2),(9, 15, 1),(9, 18, 1),(9,
00201     19, 1);
00202     
00203     
00204     CREATE TABLE my_testing 
00205     (
00206         docid INT4, 
00207         wordid INT4, 
00208         count INT4
00209     )
00210     m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)');
00211     
00212     INSERT INTO my_testing VALUES
00213     (0, 0, 2),(0, 8, 1),(0, 9, 1),(0, 10, 1),(0, 12, 1),(0, 15, 2),(0, 18, 1),(0,
00214     19, 1), (1, 0, 1),(1, 2, 1),(1, 5, 1),(1, 7, 1),(1, 12, 2),(1, 13, 1),(1, 16,
00215     1),(1, 17, 1),(1, 18, 1), (2, 0, 1),(2, 1, 1),(2, 2, 1),(2, 3, 1),(2, 4, 1),(2,
00216     5, 1),(2, 6, 1),(2, 12, 1),(2, 14, 1),(2, 18, 1), (3, 2, 2),(3, 6, 2),(3, 7,
00217     1),(3, 9, 1),(3, 11, 2),(3, 14, 1),(3, 15, 1), (4, 1, 1),(4, 2, 2),(4, 3,
00218     1),(4, 5, 2),(4, 6, 1),(4, 11, 1),(4, 18, 2);
00219     \endcode
00220 
00221 - To perform training, we call the lda_train() function with the
00222 appropriate parameters. Here is an example.
00223     \code
00224     SELECT MADLib.lda_train(
00225         'my_training',  'my_model', 'my_outdata', 20, 5, 10, 5, 0.01);
00226     \endcode
00227 
00228     After a successful run of the lda_train() function, two tables will be
00229     generated, one for storing the learned models, and another for storing the
00230     output data table.
00231 
00232     To get the detailed information about the learned model, we can run the
00233     following commands:
00234 
00235     - The topic description by top-k words
00236     \code
00237         SELECT * FROM MADLib.lda_get_topic_desc(
00238             'my_model', 'my_vocab', 'my_topic_desc', 15);
00239     \endcode
00240 
00241     - The per-topic word counts
00242     \code
00243         SELECT MADLib.lda_get_topic_word_count(
00244             'my_model', 'my_topic_word_count');
00245     \endcode
00246 
00247     - The per-word topic counts
00248     \code
00249         SELECT MADLib.lda_get_word_topic_count(
00250             'my_model', 'my_word_topic_count');
00251     \endcode
00252 
00253     To get the topic counts and the topic assignments for each doucment, we
00254     can run the following commands:
00255 
00256     - The per-document topic counts:
00257     \code
00258         SELECT 
00259             docid, topic_count 
00260         FROM my_outdata;
00261     \endcode
00262 
00263     - The per-document topic assignments:
00264     \code
00265         SELECT 
00266             docid, words, counts, topic_assignment 
00267         FROM my_outdata;
00268     \endcode
00269     By scanning \c words, \c counts, and \c topic_assignment together, we can
00270     get the topic assignment for each word in a document.
00271 
00272 - To use a learned LDA model for prediction (i.e. to label new documents), we can use the following command: 
00273     \code
00274     SELECT MADLib.lda_predict(
00275         'my_testing', 'my_model', 'my_pred');
00276     \endcode
00277 
00278     After a successful run of the lda_predict() function, the prediction
00279     results will be generated and stored in <em>my_pred</em>. This table has
00280     the same schema as the <em>my_outdata</em> generated by the lda_train()
00281     function.
00282     
00283     To get te the topic counts and the topic assignments for each doucment, we
00284     can run the following commands:
00285 
00286     - The per-document topic counts:
00287     \code
00288         SELECT 
00289             docid, topic_count 
00290         FROM my_pred;
00291     \endcode
00292 
00293     - The per-document topic assignments:
00294     \code
00295         SELECT 
00296             docid, words, counts, topic_assignment 
00297         FROM my_pred;
00298     \endcode
00299     By scanning \c words, \c counts, and \c topic_assignment together, we can
00300     get the topic assignment for each word in a document.
00301 
00302 - To compute the perplexity, we can use the following command:
00303     \code
00304         SELECT MADLib.lda_get_perplexity(
00305             'my_model', 'my_pred');
00306     \endcode
00307 
00308 @literature
00309 
00310 [1] D.M. Blei, A.Y. Ng, M.I. Jordan, <em>Latent Dirichlet Allocation</em>,
00311     Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
00312 
00313 [2] T. Griffiths and M. Steyvers, <em>Finding scientific topics</em>, PNAS,
00314     vol. 101, pp. 5228-5235, 2004.
00315 
00316 [3] Y. Wang, H. Bai, M. Stanton, W-Y. Chen, and E.Y. Chang, <em>lda: Parallel
00317     Dirichlet Allocation for Large-scale Applications</em>, AAIM, 2009.
00318 
00319 [4] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
00320 
00321 [5] J. Chang, Collapsed Gibbs sampling methods for topic models, R manual,
00322     2010.
00323 
00324 @sa File lda.sql_in documenting the SQL functions.
00325 */
00326 
00327 -- UDT for summarizing a UDF call
00328 DROP TYPE IF EXISTS MADLIB_SCHEMA.lda_result;
00329 CREATE TYPE MADLIB_SCHEMA.lda_result AS
00330 (
00331     output_table    TEXT,
00332     description     TEXT
00333 );
00334 
00335 /**
00336  * @brief This UDF provides an entry for the lda training process.
00337  * @param data_table        Table storing the training dataset, each row is in
00338  *                          the form of <docid, wordid, count> where docid,
00339  *                          wordid, and count are all non-negative integers.
00340  * @param voc_size          Size of the vocabulary (Note that the wordid should
00341  *                          be continous integers starting from 0 to voc_size -
00342  *                          1.  A data validation rountine will be called to
00343  *                          validate the dataset.)
00344  * @param topic_num         Number of topics (e.g. 100)
00345  * @param iter_num          Number of iterations (e.g. 60)
00346  * @param alpha             Dirichlet parameter for the per-doc topic multinomial
00347  *                          (e.g. 50/topic_num)
00348  * @param beta              Dirichlet parameter for the per-topic word multinomial
00349  *                          (e.g. 0.01)
00350  * @param model_table       Table storing the learned models (voc_size, topic_num, 
00351  *                          alpha, beta, per-word topic counts, and
00352  *                          corpus-level topic counts)
00353  * @param output_data_table Table storing the output data table in the form of
00354  *                          <docid, wordcount, words, counts, topic_count,
00355  *                          topic_assignment>
00356  **/
00357 CREATE OR REPLACE FUNCTION
00358 MADLIB_SCHEMA.lda_train
00359 (
00360     data_table          TEXT, 
00361     model_table         TEXT,
00362     output_data_table   TEXT,
00363     voc_size            INT4, 
00364     topic_num           INT4, 
00365     iter_num            INT4, 
00366     alpha               FLOAT8, 
00367     beta                FLOAT8 
00368 )
00369 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00370     PythonFunctionBodyOnly(`lda', `lda')
00371     lda.lda_train(
00372         schema_madlib, data_table, model_table, output_data_table, voc_size,
00373         topic_num, iter_num, alpha, beta
00374     )
00375     return [[model_table, 'model table'], 
00376         [output_data_table, 'output data table']]
00377 $$ LANGUAGE PLPYTHONU STRICT;
00378 
00379 
00380 /**
00381  * @brief This UDF provides an entry for the lda predicton process.
00382  * @param data_table    Table storing the testing dataset, each row is in the 
00383  *                      form of <docid, wordid, count>
00384  *                      where docid, wordid, and count are all non-negative 
00385  *                      integers.
00386  * @param model_table   Table storing the learned models
00387  * @param output_table  Table storing per-document topic counts and topic 
00388  *                      assignments
00389  * @note default iter_num = 20
00390  **/
00391 CREATE OR REPLACE FUNCTION
00392 MADLIB_SCHEMA.lda_predict
00393 (
00394     data_table      TEXT,
00395     model_table     TEXT,
00396     output_table    TEXT
00397 )
00398 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00399     PythonFunctionBodyOnly(`lda', `lda')
00400     lda.lda_predict(schema_madlib, data_table, model_table, output_table)
00401     return [[
00402         output_table, 
00403         'per-doc topic distribution and per-word topic assignments']]
00404 $$ LANGUAGE PLPYTHONU STRICT;
00405 
00406 /**
00407  * @brief A overloaded version which allows users to specify iter_num.
00408  **/
00409 CREATE OR REPLACE FUNCTION
00410 MADLIB_SCHEMA.lda_predict
00411 (
00412     data_table      TEXT,
00413     model_table     TEXT,
00414     output_table    TEXT,
00415     iter_num        INT4
00416 )
00417 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00418     PythonFunctionBodyOnly(`lda', `lda')
00419     lda.lda_predict(
00420         schema_madlib, data_table, model_table, output_table, iter_num)
00421     return [[
00422         output_table, 
00423         'per-doc topic distribution and per-word topic assignments']]
00424 $$ LANGUAGE PLPYTHONU STRICT;
00425 /**
00426  * @brief This UDF computes the per-topic word counts.
00427  * @param model_table   The model table generated by the training process
00428  * @param output_table  The output table storing the per-topic word counts
00429  **/
00430 CREATE OR REPLACE FUNCTION
00431 MADLIB_SCHEMA.lda_get_topic_word_count
00432 (
00433     model_table     TEXT,
00434     output_table    TEXT
00435 )
00436 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00437     PythonFunctionBodyOnly(`lda', `lda')
00438     lda.get_topic_word_count(schema_madlib, model_table, output_table)
00439     return [[output_table, 'per-topic word counts']]
00440 $$ LANGUAGE plpythonu STRICT;
00441 
00442 /**
00443  * @brief This UDF computes the per-word topic counts.
00444  * @param model_table   The model table generated by the training process
00445  * @param dist_table    The output table storing the per-word topic counts
00446  **/
00447 CREATE OR REPLACE FUNCTION
00448 MADLIB_SCHEMA.lda_get_word_topic_count
00449 (
00450     model_table     TEXT, 
00451     output_table    TEXT
00452 )
00453 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00454     PythonFunctionBodyOnly(`lda', `lda')
00455     lda.get_word_topic_count(schema_madlib, model_table, output_table)
00456     return [[output_table, 'per-word topic counts']]
00457 $$ LANGUAGE plpythonu STRICT;
00458 
00459 /**
00460  * @brief This UDF gets the description for each topic (top-k words) 
00461  * @param model_table   The model table generated by the training process
00462  * @param vocab_table   The vocabulary table (<wordid, word>)
00463  * @param top_k         The number of top words for each topic description 
00464  * @param desc_table    The output table for storing the per-topic description 
00465  **/
00466 CREATE OR REPLACE FUNCTION
00467 MADLIB_SCHEMA.lda_get_topic_desc
00468 (
00469     model_table TEXT,
00470     vocab_table TEXT,
00471     desc_table  TEXT,
00472     top_k       INT4
00473 )
00474 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00475     PythonFunctionBodyOnly(`lda', `lda')
00476     lda.get_topic_desc(
00477         schema_madlib, model_table, vocab_table, desc_table, top_k)
00478     return [[
00479         desc_table, 
00480         """topic description, use "ORDER BY topicid, prob DESC" to check the
00481         results"""]]
00482 $$ LANGUAGE plpythonu STRICT;
00483 
00484 /**
00485  * @brief This UDF assigns topics to words in a document randomly.
00486  * @param word_count    The number of words in the document
00487  * @param topic_num     The number of topics (specified by the user)
00488  * @return              The topic counts and topic assignments 
00489  **/
00490 CREATE OR REPLACE FUNCTION
00491 MADLIB_SCHEMA.__lda_random_assign
00492 (
00493     word_count  INT4, 
00494     topic_num   INT4
00495 )
00496 RETURNS INT4[]
00497 AS 'MODULE_PATHNAME', 'lda_random_assign'
00498 LANGUAGE C STRICT;
00499 
00500 /**
00501  * @brief This UDF learns the topics of words in a document and is the main
00502  * step of a Gibbs sampling iteration. The model parameter (including the
00503  * per-word topic counts and corpus-level topic counts) is passed to this
00504  * function in the first call and then transfered to the rest calls through
00505  * fcinfo->flinfo->fn_extra to allow the immediate update.
00506  * @param words             The set of unique words in the document
00507  * @param counts            The counts of each unique words in the document
00508  *                          (sum(counts) = word_count)
00509  * @param doc_topic         The current per-doc topic counts and topic
00510  *                          assignments
00511  * @param model             The current model (including the per-word topic counts
00512  *                          and the corpus-level topic counts)
00513  * @param alpha             The Dirichlet parameter for per-document topic multinomial             
00514  * @param beta              The Dirichlet parameter for per-topic word multinomial
00515  * @param voc_size          The size of vocabulary
00516  * @param topic_num         The number of topics
00517  * @return                  The learned topic counts and topic assignments 
00518  **/
00519 CREATE OR REPLACE FUNCTION
00520 MADLIB_SCHEMA.__lda_gibbs_sample
00521 (
00522     words       INT4[], 
00523     counts      INT4[],
00524     doc_topic   INT4[],
00525     model       INT4[], 
00526     alpha       FLOAT8, 
00527     beta        FLOAT8, 
00528     voc_size    INT4, 
00529     topic_num   INT4, 
00530     iter_num    INT4
00531 )
00532 RETURNS INT4[]
00533 AS 'MODULE_PATHNAME', 'lda_gibbs_sample'
00534 LANGUAGE C;
00535 
00536 /**
00537  * @brief This UDF is the sfunc for the aggregator computing the topic counts
00538  * for each word and the topic count in the whole corpus. It scans the topic
00539  * assignments in a document and updates the topic counts.
00540  * @param state             The topic counts
00541  * @param words             The unique words in the document
00542  * @param counts            The counts of each unique words in the document
00543  *                          (sum(counts) = word_count)
00544  * @param topic_assignment  The topic assignments in the document
00545  * @param topic_num         The number of topics
00546  * @return                  The updated state
00547  **/
00548 CREATE OR REPLACE FUNCTION 
00549 MADLIB_SCHEMA.__lda_count_topic_sfunc
00550 (
00551     state               INT4[],
00552     words               INT4[], 
00553     counts              INT4[],
00554     topic_assignment    INT4[], 
00555     voc_size            INT4, 
00556     topic_num           INT4
00557 )
00558 RETURNS INT4[]
00559 AS 'MODULE_PATHNAME', 'lda_count_topic_sfunc'
00560 LANGUAGE C;
00561 
00562 /**
00563  * @brief This UDF is the prefunc for the aggregator computing the per-word
00564  * topic counts.
00565  * @param state1    The local word topic counts
00566  * @param state2    The local word topic counts
00567  * @return          The element-wise sum of two local states
00568  **/
00569 CREATE OR REPLACE FUNCTION 
00570 MADLIB_SCHEMA.__lda_count_topic_prefunc
00571 (
00572     state1  INT4[],
00573     state2  INT4[]
00574 )
00575 RETURNS INT4[]
00576 AS 'MODULE_PATHNAME', 'lda_count_topic_prefunc'
00577 LANGUAGE C IMMUTABLE STRICT;
00578 
00579 /**
00580  * @brief This uda computes the word topic counts by scanning and summing
00581  * up topic assignments in each document.
00582  * @param words             The unique words in the document
00583  * @param counts            The counts of each unique words in the document
00584  * @param topic_assignment  The topic assignments in the document
00585  * @param voc_size          The size of vocabulary
00586  * @param topic_num         The number of topics
00587  * @return                  The word topic counts (a 1-d array embeding a 2-d array)
00588  **/
00589 DROP AGGREGATE IF EXISTS 
00590 MADLIB_SCHEMA.__lda_count_topic_agg
00591 (
00592     INT4[], 
00593     INT4[], 
00594     INT4[], 
00595     INT4, 
00596     INT4
00597 );
00598 CREATE AGGREGATE 
00599 MADLIB_SCHEMA.__lda_count_topic_agg
00600 (
00601     INT4[], 
00602     INT4[], 
00603     INT4[], 
00604     INT4, 
00605     INT4
00606 )
00607 (
00608     stype = INT4[],
00609     sfunc = MADLIB_SCHEMA.__lda_count_topic_sfunc
00610     m4_ifdef(
00611         `GREENPLUM', 
00612         `, prefunc = MADLIB_SCHEMA.__lda_count_topic_prefunc'
00613     )
00614 );
00615 
00616 /**
00617  * @brief This UDF computes the perplexity given the output data table and the
00618  * model table.
00619  * @param model_table   The model table generated by lda_train
00620  * @param output_table  The output data table generated by lda_predict
00621  * @return              The perplexity
00622  **/
00623 CREATE OR REPLACE FUNCTION
00624 MADLIB_SCHEMA.lda_get_perplexity
00625 (
00626     model_table       TEXT,
00627     output_data_table TEXT
00628 )
00629 RETURNS FLOAT8 AS $$
00630     PythonFunctionBodyOnly(`lda', `lda')
00631     return lda.get_perplexity(
00632         schema_madlib, model_table, output_data_table)
00633 $$ LANGUAGE plpythonu STRICT;
00634 
00635 /**
00636  * @brief This UDF is the sfunc for the aggregator computing the perpleixty.
00637  * @param state             The cached model plus perplexity
00638  * @param words             The unique words in the document
00639  * @param counts            The counts of each unique words in the document
00640  * @param doc_topic         The topic counts in the document
00641  * @param model             The learned model
00642  * @param alpha             The Dirichlet parameter for per-document topic multinomial             
00643  * @param beta              The Dirichlet parameter for per-topic word multinomial
00644  * @param voc_size          The size of vocabulary
00645  * @param topic_num         The number of topics
00646  * @return                  The updated state 
00647  **/
00648 CREATE OR REPLACE FUNCTION 
00649 MADLIB_SCHEMA.__lda_perplexity_sfunc
00650 (
00651     state               INT4[],
00652     words               INT4[], 
00653     counts              INT4[],
00654     doc_topic           INT4[], 
00655     model               INT4[][], 
00656     alpha               FLOAT8,
00657     beta                FLOAT8,
00658     voc_size            INT4, 
00659     topic_num           INT4
00660 )
00661 RETURNS INT4[]
00662 AS 'MODULE_PATHNAME', 'lda_perplexity_sfunc'
00663 LANGUAGE C IMMUTABLE;
00664 
00665 /**
00666  * @brief This UDF is the prefunc for the aggregator computing the perplexity.
00667  * @param state1    The local state 
00668  * @param state2    The local state 
00669  * @return          The merged state 
00670  **/
00671 CREATE OR REPLACE FUNCTION 
00672 MADLIB_SCHEMA.__lda_perplexity_prefunc
00673 (
00674     state1  INT4[],
00675     state2  INT4[] 
00676 )
00677 RETURNS INT4[] 
00678 AS 'MODULE_PATHNAME', 'lda_perplexity_prefunc'
00679 LANGUAGE C IMMUTABLE STRICT;
00680 
00681 /**
00682  * @brief This UDF is the finalfunc for the aggregator computing the perplexity.
00683  * @param state    The merged state 
00684  * @return         The perpleixty
00685  **/
00686 CREATE OR REPLACE FUNCTION 
00687 MADLIB_SCHEMA.__lda_perplexity_ffunc
00688 (
00689     state  INT4[]
00690 )
00691 RETURNS FLOAT8 
00692 AS 'MODULE_PATHNAME', 'lda_perplexity_ffunc'
00693 LANGUAGE C IMMUTABLE STRICT;
00694 
00695 /**
00696  * @brief This aggregator computes the perpleixty.
00697  * @param words             The unique words in the document
00698  * @param counts            The counts of each unique words in the document
00699  * @param doc_topic         The topic counts in the document
00700  * @param model             The learned model
00701  * @param alpha             The Dirichlet parameter for per-document topic multinomial             
00702  * @param beta              The Dirichlet parameter for per-topic word multinomial
00703  * @param voc_size          The size of vocabulary
00704  * @param topic_num         The number of topics
00705  * @return                  The updated perplexity
00706  **/
00707 DROP AGGREGATE IF EXISTS 
00708 MADLIB_SCHEMA.__lda_perplexity_agg
00709 (
00710     INT4[], 
00711     INT4[],
00712     INT4[], 
00713     INT4[], 
00714     FLOAT8,
00715     FLOAT8,
00716     INT4, 
00717     INT4
00718 );
00719 CREATE AGGREGATE 
00720 MADLIB_SCHEMA.__lda_perplexity_agg
00721 (
00722     INT4[], 
00723     INT4[],
00724     INT4[], 
00725     INT4[], 
00726     FLOAT8,
00727     FLOAT8,
00728     INT4, 
00729     INT4
00730 )
00731 (
00732     stype = INT4[],
00733     sfunc = MADLIB_SCHEMA.__lda_perplexity_sfunc,
00734     finalfunc = MADLIB_SCHEMA.__lda_perplexity_ffunc
00735     m4_ifdef(
00736         `GREENPLUM', 
00737         `, prefunc = MADLIB_SCHEMA.__lda_perplexity_prefunc'
00738     )
00739 );
00740 
00741 /**
00742  * @brief Unnest a 2-D array into a set of 1-D arrays 
00743  * @param arr   The 2-D array to be unnested
00744  * @return      The unnested 1-D arrays
00745  **/
00746 CREATE OR REPLACE FUNCTION 
00747 MADLIB_SCHEMA.__lda_util_unnest
00748 (
00749     arr INT4[][]
00750 )
00751 RETURNS SETOF INT4[] 
00752 AS 'MODULE_PATHNAME', 'lda_unnest'
00753 LANGUAGE C IMMUTABLE STRICT;
00754 
00755 /**
00756  * @brief Transpose a 2-D array
00757  * @param matrix    The input 2-D array
00758  * @param           The transposed array
00759  **/
00760 CREATE OR REPLACE FUNCTION
00761 MADLIB_SCHEMA.__lda_util_transpose
00762 (
00763     matrix  INT4[][]
00764 )
00765 RETURNS INT4[][]
00766 AS 'MODULE_PATHNAME', 'lda_transpose'
00767 LANGUAGE C IMMUTABLE STRICT;
00768 
00769 /**
00770  * @brief L1 normalization with smoothing
00771  * @param arr       The array to be normalized
00772  * @param smooth    The smoothing parameter
00773  * @return          The normalized vector
00774  **/
00775 CREATE OR REPLACE FUNCTION 
00776 MADLIB_SCHEMA.__lda_util_norm_with_smoothing
00777 (
00778     arr     FLOAT8[],
00779     smooth  FLOAT8
00780 )
00781 RETURNS FLOAT8[] AS $$
00782     PythonFunctionBodyOnly(`lda', `lda')
00783     return lda.l1_norm_with_smoothing(arr, smooth)
00784 $$ LANGUAGE PLPYTHONU STRICT;
00785 
00786 /**
00787  * @brief This UDF returns the index of elements in a sorted order
00788  * @param arr   The array to be sorted
00789  * @return      The index of elements
00790  **/
00791 CREATE OR REPLACE FUNCTION
00792 MADLIB_SCHEMA.__lda_util_index_sort
00793 (
00794     arr FLOAT8[]
00795 )
00796 RETURNS INT4[] AS $$
00797     PythonFunctionBodyOnly(`lda', `lda')
00798     return lda.index_sort(arr)
00799 $$ LANGUAGE plpythonu STRICT;
00800 
00801 /**
00802  * @brief This UDF checks the vocabulary and converts non-continous wordids into
00803  * continuous integers ranging from 0 to voc_size - 1.
00804  * @param vocab_table   The vocabulary table in the form of 
00805                         <wordid::int4, word::text>
00806  * @param output_vocab_table     The regularized vocabulary table 
00807  **/
00808 CREATE OR REPLACE FUNCTION
00809 MADLIB_SCHEMA.__lda_util_norm_vocab
00810 (
00811     vocab_table         TEXT,
00812     output_vocab_table  TEXT
00813 )
00814 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00815     PythonFunctionBodyOnly(`lda', `lda')
00816     lda.norm_vocab(vocab_table, output_vocab_table)
00817     return [[output_vocab_table,'normalized vocbulary table']]
00818 $$ LANGUAGE plpythonu STRICT;
00819 
00820 /**
00821  * @brief This UDF converts the data table according to the normalized
00822  * vocabulary, and all rows with non-positive count values will be removed
00823  * @param data_table    The data table to be normalized
00824  * @param vocab_table   The normalized vocabulary table
00825  * @param output_data_table  The normalized data table
00826  **/
00827 CREATE OR REPLACE FUNCTION
00828 MADLIB_SCHEMA.__lda_util_norm_dataset
00829 (
00830     data_table          TEXT,
00831     norm_vocab_table    TEXT,
00832     output_data_table   TEXT
00833 )
00834 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00835     PythonFunctionBodyOnly(`lda', `lda')
00836     lda.norm_dataset(data_table, norm_vocab_table, output_data_table)
00837     return [[output_data_table,'normalized data table']]
00838 $$ LANGUAGE plpythonu STRICT;
00839 
00840 /**
00841  * @brief This UDF extracts the list of wordids from the data table and joins
00842  * it with the vocabulary table to get the list of common wordids, next it will
00843  * normalize the vocabulary based on the common wordids and then normalize the
00844  * data table based on the normalized vocabulary.
00845  * @param data_table            The data table to be normalized
00846  * @param vocab_table           The vocabulary table to be normalized
00847  * @param output_data_table     The normalized data table
00848  * @param output_vocab_table    The normalized vocabulary table
00849  **/
00850 CREATE OR REPLACE FUNCTION
00851 MADLIB_SCHEMA.__lda_util_conorm_data
00852 (
00853     data_table          TEXT,
00854     vocab_table         TEXT,
00855     output_data_table   TEXT,
00856     output_vocab_table  TEXT
00857 )
00858 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$
00859     PythonFunctionBodyOnly(`lda', `lda')
00860     lda.conorm_data(
00861         data_table, vocab_table, output_data_table, output_vocab_table)
00862     return [[output_data_table,'normalized data table'],
00863         [output_vocab_table,'normalized vocab table']]
00864 $$ LANGUAGE plpythonu STRICT;
00865