MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file lda.sql_in 00004 * 00005 * @brief SQL functions for Latent Dirichlet Allocation 00006 * @date Dec 2012 00007 * 00008 * @sa For an introduction to Latent Dirichlet Allocation models, see the 00009 module description \ref grp_lda. 00010 * 00011 *//* ------------------------------------------------------------------------*/ 00012 00013 m4_include(`SQLCommon.m4') 00014 00015 /** 00016 00017 @addtogroup grp_lda 00018 00019 @about 00020 00021 Latent Dirichlet Allocation (LDA) is an interesting generative probabilistic 00022 model for natural texts and has received a lot of attention in recent years. 00023 The model is quite versatile, having found uses in problems like automated 00024 topic discovery, collaborative filtering, and document classification. 00025 00026 The LDA model posits that each document is associated with a mixture of various 00027 topics (e.g. a document is related to Topic 1 with probability 0.7, and Topic 2 00028 with probability 0.3), and that each word in the document is attributable to 00029 one of the document's topics. There is a (symmetric) Dirichlet prior with 00030 parameter \f$ \alpha \f$ on each document's topic mixture. In addition, there 00031 is another (symmetric) Dirichlet prior with parameter \f$ \beta \f$ on the 00032 distribution of words for each topic. 00033 00034 The following generative process then defines a distribution over a corpus of 00035 documents. 00036 00037 - Sample for each topic \f$ i \f$, a per-topic word 00038 distribution \f$ \phi_i \f$ from the Dirichlet(\f$\beta\f$) prior. 00039 00040 - For each document: 00041 - Sample a document length N from a suitable distribution, say, Poisson. 00042 - Sample a topic mixture \f$ \theta \f$ for the document from the 00043 Dirichlet(\f$\alpha\f$) distribution. 00044 - For each of the N words: 00045 - Sample a topic \f$ z_n \f$ from the multinomial topic distribution \f$ 00046 \theta \f$. 00047 - Sample a word \f$ w_n \f$ from the multinomial word distribution \f$ 00048 \phi_{z_n} \f$ associated with topic \f$ z_n \f$. 00049 00050 In practice, only the words in each document are observable. The topic mixture 00051 of each document and the topic for each word in each document are latent 00052 unobservable variables that need to be inferred from the observables, and this 00053 is the problem people refer to when they talk about the inference problem for 00054 LDA. Exact inference is intractable, but several approximate inference 00055 algorithms for LDA have been developed. The simple and effective Gibbs sampling 00056 algorithm described in Griffiths and Steyvers [2] appears to be the current 00057 algorithm of choice. 00058 00059 This implementation provides a parallel and scalable in-database solution for 00060 LDA based on Gibbs sampling. Different with the implementations based on MPI or 00061 Hadoop Map/Reduce, this implementation builds upon the shared-nothing MPP 00062 databases and enables high-performance in-database analytics. 00063 00064 @input 00065 The \b corpus/dataset to be analyzed is expected to be of the following form: 00066 <pre>{TABLE|VIEW} <em>data_table</em> ( 00067 <em>docid</em> INTEGER, 00068 <em>wordid</em> INTEGER, 00069 <em>count</em> INTEGER 00070 )</pre> 00071 where \c docid refers to the document ID, \c wordid is the word ID (the index 00072 of a word in the vocabulary), and \c count is the number of occurence of the 00073 word in the document. 00074 00075 The \b vocabulary/dictionary that indexes all the words found in the corpus is 00076 of the following form: 00077 <pre>{TABLE|VIEW} <em>vocab_table</em> ( 00078 <em>wordid</em> INTEGER, 00079 <em>word</em> TEXT, 00080 )</pre> 00081 where \c wordid refers the word ID (the index of a word in the vocabulary) and 00082 \c word is the actual word. 00083 00084 @usage 00085 - The training (i.e. topic inference) can be done with the following function: 00086 <pre> 00087 SELECT \ref lda_train( 00088 <em>'data_table'</em>, 00089 <em>'model_table'</em>, 00090 <em>'output_data_table'</em>, 00091 <em>voc_size</em>, 00092 <em>topic_num</em>, 00093 <em>iter_num</em>, 00094 <em>alpha</em>, 00095 <em>beta</em>) 00096 </pre> 00097 00098 This function stores the resulting model in <tt><em>model_table</em></tt>. 00099 The table has only 1 row and is in the following form: 00100 <pre>{TABLE} <em>model_table</em> ( 00101 <em>voc_size</em> INTEGER, 00102 <em>topic_num</em> INTEGER, 00103 <em>alpha</em> FLOAT, 00104 <em>beta</em> FLOAT, 00105 <em>model</em> INTEGER[][]) 00106 </pre> 00107 00108 This function also stores the topic counts and the topic assignments in 00109 each document in <tt><em>output_data_table</em></tt>. The table is in the 00110 following form: 00111 <pre>{TABLE} <em>output_data_table</em> ( 00112 <em>docid</em> INTEGER, 00113 <em>wordcount</em> INTEGER, 00114 <em>words</em> INTEGER[], 00115 <em>counts</em> INTEGER[], 00116 <em>topic_count</em> INTEGER[], 00117 <em>topic_assignment</em> INTEGER[]) 00118 </pre> 00119 00120 - The prediction (i.e. labelling of test documents using a learned LDA model) 00121 can be done with the following function: 00122 <pre> 00123 SELECT \ref lda_predict( 00124 <em>'data_table'</em>, 00125 <em>'model_table'</em>, 00126 <em>'output_table'</em>); 00127 </pre> 00128 00129 This function stores the prediction results in 00130 <tt><em>output_table</em></tt>. Each row in the table stores the topic 00131 distribution and the topic assignments for a docuemnt in the dataset. And 00132 the table is in the following form: 00133 <pre>{TABLE} <em>output_table</em> ( 00134 <em>docid</em> INTEGER, 00135 <em>wordcount</em> INTEGER, 00136 <em>words</em> INTEGER, 00137 <em>counts</em> INTEGER, 00138 <em>topic_count</em> INTEGER[], 00139 <em>topic_assignment</em> INTEGER[]) 00140 </pre> 00141 00142 - This module also provides a function for computing the perplexity: 00143 <pre> 00144 SELECT \ref lda_get_perplexity( 00145 <em>'model_table'</em>, 00146 <em>'output_data_table'</em>); 00147 </pre> 00148 00149 @implementation 00150 The input format for this module is very common in many machine learning 00151 packages written in various lanugages, which allows users to generate 00152 datasets using any existing document preprocessing tools or import existing 00153 dataset very conveniently. Internally, the input data will be validated and then 00154 converted to the following format for efficiency: 00155 <pre>{TABLE} <em>__internal_data_table__</em> ( 00156 <em>docid</em> INTEGER, 00157 <em>wordcount</em> INTEGER, 00158 <em>words</em> INTEGER[], 00159 <em>counts</em> INTEGER[]) 00160 </pre> 00161 where \c docid is the document ID, \c wordcount is the count of words in the 00162 document, \c words is the list of unique words in the document, and \c counts 00163 is the list of number of occurence of each unique word in the document. The 00164 convertion can be done with the help of aggregation functions very easily. 00165 00166 @examp 00167 00168 We now give a usage example. 00169 00170 - As a first step, we need to prepare a dataset and vocabulary in the appropriate structure. 00171 \code 00172 CREATE TABLE my_vocab(wordid INT4, word TEXT) 00173 m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (wordid)'); 00174 00175 INSERT INTO my_vocab VALUES 00176 (0, 'code'), (1, 'data'), (2, 'graph'), (3, 'image'), (4, 'input'), (5, 00177 'layer'), (6, 'learner'), (7, 'loss'), (8, 'model'), (9, 'network'), (10, 00178 'neuron'), (11, 'object'), (12, 'output'), (13, 'rate'), (14, 'set'), (15, 00179 'signal'), (16, 'sparse'), (17, 'spatial'), (18, 'system'), (19, 'training'); 00180 00181 CREATE TABLE my_training 00182 ( 00183 docid INT4, 00184 wordid INT4, 00185 count INT4 00186 ) 00187 m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)'); 00188 00189 INSERT INTO my_training VALUES 00190 (0, 0, 2),(0, 3, 2),(0, 5, 1),(0, 7, 1),(0, 8, 1),(0, 9, 1),(0, 11, 1),(0, 13, 00191 1), (1, 0, 1),(1, 3, 1),(1, 4, 1),(1, 5, 1),(1, 6, 1),(1, 7, 1),(1, 10, 1),(1, 00192 14, 1),(1, 17, 1),(1, 18, 1), (2, 4, 2),(2, 5, 1),(2, 6, 2),(2, 12, 1),(2, 13, 00193 1),(2, 15, 1),(2, 18, 2), (3, 0, 1),(3, 1, 2),(3, 12, 3),(3, 16, 1),(3, 17, 00194 2),(3, 19, 1), (4, 1, 1),(4, 2, 1),(4, 3, 1),(4, 5, 1),(4, 6, 1),(4, 10, 1),(4, 00195 11, 1),(4, 14, 1),(4, 18, 1),(4, 19, 1), (5, 0, 1),(5, 2, 1),(5, 5, 1),(5, 7, 00196 1),(5, 10, 1),(5, 12, 1),(5, 16, 1),(5, 18, 1),(5, 19, 2), (6, 1, 1),(6, 3, 00197 1),(6, 12, 2),(6, 13, 1),(6, 14, 2),(6, 15, 1),(6, 16, 1),(6, 17, 1), (7, 0, 00198 1),(7, 2, 1),(7, 4, 1),(7, 5, 1),(7, 7, 2),(7, 8, 1),(7, 11, 1),(7, 14, 1),(7, 00199 16, 1), (8, 2, 1),(8, 4, 4),(8, 6, 2),(8, 11, 1),(8, 15, 1),(8, 18, 1), 00200 (9, 0, 1),(9, 1, 1),(9, 4, 1),(9, 9, 2),(9, 12, 2),(9, 15, 1),(9, 18, 1),(9, 00201 19, 1); 00202 00203 00204 CREATE TABLE my_testing 00205 ( 00206 docid INT4, 00207 wordid INT4, 00208 count INT4 00209 ) 00210 m4_ifdef(`__GREENPLUM__',`DISTRIBUTED BY (docid)'); 00211 00212 INSERT INTO my_testing VALUES 00213 (0, 0, 2),(0, 8, 1),(0, 9, 1),(0, 10, 1),(0, 12, 1),(0, 15, 2),(0, 18, 1),(0, 00214 19, 1), (1, 0, 1),(1, 2, 1),(1, 5, 1),(1, 7, 1),(1, 12, 2),(1, 13, 1),(1, 16, 00215 1),(1, 17, 1),(1, 18, 1), (2, 0, 1),(2, 1, 1),(2, 2, 1),(2, 3, 1),(2, 4, 1),(2, 00216 5, 1),(2, 6, 1),(2, 12, 1),(2, 14, 1),(2, 18, 1), (3, 2, 2),(3, 6, 2),(3, 7, 00217 1),(3, 9, 1),(3, 11, 2),(3, 14, 1),(3, 15, 1), (4, 1, 1),(4, 2, 2),(4, 3, 00218 1),(4, 5, 2),(4, 6, 1),(4, 11, 1),(4, 18, 2); 00219 \endcode 00220 00221 - To perform training, we call the lda_train() function with the 00222 appropriate parameters. Here is an example. 00223 \code 00224 SELECT MADLib.lda_train( 00225 'my_training', 'my_model', 'my_outdata', 20, 5, 10, 5, 0.01); 00226 \endcode 00227 00228 After a successful run of the lda_train() function, two tables will be 00229 generated, one for storing the learned models, and another for storing the 00230 output data table. 00231 00232 To get the detailed information about the learned model, we can run the 00233 following commands: 00234 00235 - The topic description by top-k words 00236 \code 00237 SELECT * FROM MADLib.lda_get_topic_desc( 00238 'my_model', 'my_vocab', 'my_topic_desc', 15); 00239 \endcode 00240 00241 - The per-topic word counts 00242 \code 00243 SELECT MADLib.lda_get_topic_word_count( 00244 'my_model', 'my_topic_word_count'); 00245 \endcode 00246 00247 - The per-word topic counts 00248 \code 00249 SELECT MADLib.lda_get_word_topic_count( 00250 'my_model', 'my_word_topic_count'); 00251 \endcode 00252 00253 To get the topic counts and the topic assignments for each doucment, we 00254 can run the following commands: 00255 00256 - The per-document topic counts: 00257 \code 00258 SELECT 00259 docid, topic_count 00260 FROM my_outdata; 00261 \endcode 00262 00263 - The per-document topic assignments: 00264 \code 00265 SELECT 00266 docid, words, counts, topic_assignment 00267 FROM my_outdata; 00268 \endcode 00269 By scanning \c words, \c counts, and \c topic_assignment together, we can 00270 get the topic assignment for each word in a document. 00271 00272 - To use a learned LDA model for prediction (i.e. to label new documents), we can use the following command: 00273 \code 00274 SELECT MADLib.lda_predict( 00275 'my_testing', 'my_model', 'my_pred'); 00276 \endcode 00277 00278 After a successful run of the lda_predict() function, the prediction 00279 results will be generated and stored in <em>my_pred</em>. This table has 00280 the same schema as the <em>my_outdata</em> generated by the lda_train() 00281 function. 00282 00283 To get te the topic counts and the topic assignments for each doucment, we 00284 can run the following commands: 00285 00286 - The per-document topic counts: 00287 \code 00288 SELECT 00289 docid, topic_count 00290 FROM my_pred; 00291 \endcode 00292 00293 - The per-document topic assignments: 00294 \code 00295 SELECT 00296 docid, words, counts, topic_assignment 00297 FROM my_pred; 00298 \endcode 00299 By scanning \c words, \c counts, and \c topic_assignment together, we can 00300 get the topic assignment for each word in a document. 00301 00302 - To compute the perplexity, we can use the following command: 00303 \code 00304 SELECT MADLib.lda_get_perplexity( 00305 'my_model', 'my_pred'); 00306 \endcode 00307 00308 @literature 00309 00310 [1] D.M. Blei, A.Y. Ng, M.I. Jordan, <em>Latent Dirichlet Allocation</em>, 00311 Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. 00312 00313 [2] T. Griffiths and M. Steyvers, <em>Finding scientific topics</em>, PNAS, 00314 vol. 101, pp. 5228-5235, 2004. 00315 00316 [3] Y. Wang, H. Bai, M. Stanton, W-Y. Chen, and E.Y. Chang, <em>lda: Parallel 00317 Dirichlet Allocation for Large-scale Applications</em>, AAIM, 2009. 00318 00319 [4] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 00320 00321 [5] J. Chang, Collapsed Gibbs sampling methods for topic models, R manual, 00322 2010. 00323 00324 @sa File lda.sql_in documenting the SQL functions. 00325 */ 00326 00327 -- UDT for summarizing a UDF call 00328 DROP TYPE IF EXISTS MADLIB_SCHEMA.lda_result; 00329 CREATE TYPE MADLIB_SCHEMA.lda_result AS 00330 ( 00331 output_table TEXT, 00332 description TEXT 00333 ); 00334 00335 /** 00336 * @brief This UDF provides an entry for the lda training process. 00337 * @param data_table Table storing the training dataset, each row is in 00338 * the form of <docid, wordid, count> where docid, 00339 * wordid, and count are all non-negative integers. 00340 * @param voc_size Size of the vocabulary (Note that the wordid should 00341 * be continous integers starting from 0 to voc_size - 00342 * 1. A data validation rountine will be called to 00343 * validate the dataset.) 00344 * @param topic_num Number of topics (e.g. 100) 00345 * @param iter_num Number of iterations (e.g. 60) 00346 * @param alpha Dirichlet parameter for the per-doc topic multinomial 00347 * (e.g. 50/topic_num) 00348 * @param beta Dirichlet parameter for the per-topic word multinomial 00349 * (e.g. 0.01) 00350 * @param model_table Table storing the learned models (voc_size, topic_num, 00351 * alpha, beta, per-word topic counts, and 00352 * corpus-level topic counts) 00353 * @param output_data_table Table storing the output data table in the form of 00354 * <docid, wordcount, words, counts, topic_count, 00355 * topic_assignment> 00356 **/ 00357 CREATE OR REPLACE FUNCTION 00358 MADLIB_SCHEMA.lda_train 00359 ( 00360 data_table TEXT, 00361 model_table TEXT, 00362 output_data_table TEXT, 00363 voc_size INT4, 00364 topic_num INT4, 00365 iter_num INT4, 00366 alpha FLOAT8, 00367 beta FLOAT8 00368 ) 00369 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00370 PythonFunctionBodyOnly(`lda', `lda') 00371 lda.lda_train( 00372 schema_madlib, data_table, model_table, output_data_table, voc_size, 00373 topic_num, iter_num, alpha, beta 00374 ) 00375 return [[model_table, 'model table'], 00376 [output_data_table, 'output data table']] 00377 $$ LANGUAGE PLPYTHONU STRICT; 00378 00379 00380 /** 00381 * @brief This UDF provides an entry for the lda predicton process. 00382 * @param data_table Table storing the testing dataset, each row is in the 00383 * form of <docid, wordid, count> 00384 * where docid, wordid, and count are all non-negative 00385 * integers. 00386 * @param model_table Table storing the learned models 00387 * @param output_table Table storing per-document topic counts and topic 00388 * assignments 00389 * @note default iter_num = 20 00390 **/ 00391 CREATE OR REPLACE FUNCTION 00392 MADLIB_SCHEMA.lda_predict 00393 ( 00394 data_table TEXT, 00395 model_table TEXT, 00396 output_table TEXT 00397 ) 00398 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00399 PythonFunctionBodyOnly(`lda', `lda') 00400 lda.lda_predict(schema_madlib, data_table, model_table, output_table) 00401 return [[ 00402 output_table, 00403 'per-doc topic distribution and per-word topic assignments']] 00404 $$ LANGUAGE PLPYTHONU STRICT; 00405 00406 /** 00407 * @brief A overloaded version which allows users to specify iter_num. 00408 **/ 00409 CREATE OR REPLACE FUNCTION 00410 MADLIB_SCHEMA.lda_predict 00411 ( 00412 data_table TEXT, 00413 model_table TEXT, 00414 output_table TEXT, 00415 iter_num INT4 00416 ) 00417 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00418 PythonFunctionBodyOnly(`lda', `lda') 00419 lda.lda_predict( 00420 schema_madlib, data_table, model_table, output_table, iter_num) 00421 return [[ 00422 output_table, 00423 'per-doc topic distribution and per-word topic assignments']] 00424 $$ LANGUAGE PLPYTHONU STRICT; 00425 /** 00426 * @brief This UDF computes the per-topic word counts. 00427 * @param model_table The model table generated by the training process 00428 * @param output_table The output table storing the per-topic word counts 00429 **/ 00430 CREATE OR REPLACE FUNCTION 00431 MADLIB_SCHEMA.lda_get_topic_word_count 00432 ( 00433 model_table TEXT, 00434 output_table TEXT 00435 ) 00436 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00437 PythonFunctionBodyOnly(`lda', `lda') 00438 lda.get_topic_word_count(schema_madlib, model_table, output_table) 00439 return [[output_table, 'per-topic word counts']] 00440 $$ LANGUAGE plpythonu STRICT; 00441 00442 /** 00443 * @brief This UDF computes the per-word topic counts. 00444 * @param model_table The model table generated by the training process 00445 * @param dist_table The output table storing the per-word topic counts 00446 **/ 00447 CREATE OR REPLACE FUNCTION 00448 MADLIB_SCHEMA.lda_get_word_topic_count 00449 ( 00450 model_table TEXT, 00451 output_table TEXT 00452 ) 00453 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00454 PythonFunctionBodyOnly(`lda', `lda') 00455 lda.get_word_topic_count(schema_madlib, model_table, output_table) 00456 return [[output_table, 'per-word topic counts']] 00457 $$ LANGUAGE plpythonu STRICT; 00458 00459 /** 00460 * @brief This UDF gets the description for each topic (top-k words) 00461 * @param model_table The model table generated by the training process 00462 * @param vocab_table The vocabulary table (<wordid, word>) 00463 * @param top_k The number of top words for each topic description 00464 * @param desc_table The output table for storing the per-topic description 00465 **/ 00466 CREATE OR REPLACE FUNCTION 00467 MADLIB_SCHEMA.lda_get_topic_desc 00468 ( 00469 model_table TEXT, 00470 vocab_table TEXT, 00471 desc_table TEXT, 00472 top_k INT4 00473 ) 00474 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00475 PythonFunctionBodyOnly(`lda', `lda') 00476 lda.get_topic_desc( 00477 schema_madlib, model_table, vocab_table, desc_table, top_k) 00478 return [[ 00479 desc_table, 00480 """topic description, use "ORDER BY topicid, prob DESC" to check the 00481 results"""]] 00482 $$ LANGUAGE plpythonu STRICT; 00483 00484 /** 00485 * @brief This UDF assigns topics to words in a document randomly. 00486 * @param word_count The number of words in the document 00487 * @param topic_num The number of topics (specified by the user) 00488 * @return The topic counts and topic assignments 00489 **/ 00490 CREATE OR REPLACE FUNCTION 00491 MADLIB_SCHEMA.__lda_random_assign 00492 ( 00493 word_count INT4, 00494 topic_num INT4 00495 ) 00496 RETURNS INT4[] 00497 AS 'MODULE_PATHNAME', 'lda_random_assign' 00498 LANGUAGE C STRICT; 00499 00500 /** 00501 * @brief This UDF learns the topics of words in a document and is the main 00502 * step of a Gibbs sampling iteration. The model parameter (including the 00503 * per-word topic counts and corpus-level topic counts) is passed to this 00504 * function in the first call and then transfered to the rest calls through 00505 * fcinfo->flinfo->fn_extra to allow the immediate update. 00506 * @param words The set of unique words in the document 00507 * @param counts The counts of each unique words in the document 00508 * (sum(counts) = word_count) 00509 * @param doc_topic The current per-doc topic counts and topic 00510 * assignments 00511 * @param model The current model (including the per-word topic counts 00512 * and the corpus-level topic counts) 00513 * @param alpha The Dirichlet parameter for per-document topic multinomial 00514 * @param beta The Dirichlet parameter for per-topic word multinomial 00515 * @param voc_size The size of vocabulary 00516 * @param topic_num The number of topics 00517 * @return The learned topic counts and topic assignments 00518 **/ 00519 CREATE OR REPLACE FUNCTION 00520 MADLIB_SCHEMA.__lda_gibbs_sample 00521 ( 00522 words INT4[], 00523 counts INT4[], 00524 doc_topic INT4[], 00525 model INT4[], 00526 alpha FLOAT8, 00527 beta FLOAT8, 00528 voc_size INT4, 00529 topic_num INT4, 00530 iter_num INT4 00531 ) 00532 RETURNS INT4[] 00533 AS 'MODULE_PATHNAME', 'lda_gibbs_sample' 00534 LANGUAGE C; 00535 00536 /** 00537 * @brief This UDF is the sfunc for the aggregator computing the topic counts 00538 * for each word and the topic count in the whole corpus. It scans the topic 00539 * assignments in a document and updates the topic counts. 00540 * @param state The topic counts 00541 * @param words The unique words in the document 00542 * @param counts The counts of each unique words in the document 00543 * (sum(counts) = word_count) 00544 * @param topic_assignment The topic assignments in the document 00545 * @param topic_num The number of topics 00546 * @return The updated state 00547 **/ 00548 CREATE OR REPLACE FUNCTION 00549 MADLIB_SCHEMA.__lda_count_topic_sfunc 00550 ( 00551 state INT4[], 00552 words INT4[], 00553 counts INT4[], 00554 topic_assignment INT4[], 00555 voc_size INT4, 00556 topic_num INT4 00557 ) 00558 RETURNS INT4[] 00559 AS 'MODULE_PATHNAME', 'lda_count_topic_sfunc' 00560 LANGUAGE C; 00561 00562 /** 00563 * @brief This UDF is the prefunc for the aggregator computing the per-word 00564 * topic counts. 00565 * @param state1 The local word topic counts 00566 * @param state2 The local word topic counts 00567 * @return The element-wise sum of two local states 00568 **/ 00569 CREATE OR REPLACE FUNCTION 00570 MADLIB_SCHEMA.__lda_count_topic_prefunc 00571 ( 00572 state1 INT4[], 00573 state2 INT4[] 00574 ) 00575 RETURNS INT4[] 00576 AS 'MODULE_PATHNAME', 'lda_count_topic_prefunc' 00577 LANGUAGE C IMMUTABLE STRICT; 00578 00579 /** 00580 * @brief This uda computes the word topic counts by scanning and summing 00581 * up topic assignments in each document. 00582 * @param words The unique words in the document 00583 * @param counts The counts of each unique words in the document 00584 * @param topic_assignment The topic assignments in the document 00585 * @param voc_size The size of vocabulary 00586 * @param topic_num The number of topics 00587 * @return The word topic counts (a 1-d array embeding a 2-d array) 00588 **/ 00589 DROP AGGREGATE IF EXISTS 00590 MADLIB_SCHEMA.__lda_count_topic_agg 00591 ( 00592 INT4[], 00593 INT4[], 00594 INT4[], 00595 INT4, 00596 INT4 00597 ); 00598 CREATE AGGREGATE 00599 MADLIB_SCHEMA.__lda_count_topic_agg 00600 ( 00601 INT4[], 00602 INT4[], 00603 INT4[], 00604 INT4, 00605 INT4 00606 ) 00607 ( 00608 stype = INT4[], 00609 sfunc = MADLIB_SCHEMA.__lda_count_topic_sfunc 00610 m4_ifdef( 00611 `GREENPLUM', 00612 `, prefunc = MADLIB_SCHEMA.__lda_count_topic_prefunc' 00613 ) 00614 ); 00615 00616 /** 00617 * @brief This UDF computes the perplexity given the output data table and the 00618 * model table. 00619 * @param model_table The model table generated by lda_train 00620 * @param output_table The output data table generated by lda_predict 00621 * @return The perplexity 00622 **/ 00623 CREATE OR REPLACE FUNCTION 00624 MADLIB_SCHEMA.lda_get_perplexity 00625 ( 00626 model_table TEXT, 00627 output_data_table TEXT 00628 ) 00629 RETURNS FLOAT8 AS $$ 00630 PythonFunctionBodyOnly(`lda', `lda') 00631 return lda.get_perplexity( 00632 schema_madlib, model_table, output_data_table) 00633 $$ LANGUAGE plpythonu STRICT; 00634 00635 /** 00636 * @brief This UDF is the sfunc for the aggregator computing the perpleixty. 00637 * @param state The cached model plus perplexity 00638 * @param words The unique words in the document 00639 * @param counts The counts of each unique words in the document 00640 * @param doc_topic The topic counts in the document 00641 * @param model The learned model 00642 * @param alpha The Dirichlet parameter for per-document topic multinomial 00643 * @param beta The Dirichlet parameter for per-topic word multinomial 00644 * @param voc_size The size of vocabulary 00645 * @param topic_num The number of topics 00646 * @return The updated state 00647 **/ 00648 CREATE OR REPLACE FUNCTION 00649 MADLIB_SCHEMA.__lda_perplexity_sfunc 00650 ( 00651 state INT4[], 00652 words INT4[], 00653 counts INT4[], 00654 doc_topic INT4[], 00655 model INT4[][], 00656 alpha FLOAT8, 00657 beta FLOAT8, 00658 voc_size INT4, 00659 topic_num INT4 00660 ) 00661 RETURNS INT4[] 00662 AS 'MODULE_PATHNAME', 'lda_perplexity_sfunc' 00663 LANGUAGE C IMMUTABLE; 00664 00665 /** 00666 * @brief This UDF is the prefunc for the aggregator computing the perplexity. 00667 * @param state1 The local state 00668 * @param state2 The local state 00669 * @return The merged state 00670 **/ 00671 CREATE OR REPLACE FUNCTION 00672 MADLIB_SCHEMA.__lda_perplexity_prefunc 00673 ( 00674 state1 INT4[], 00675 state2 INT4[] 00676 ) 00677 RETURNS INT4[] 00678 AS 'MODULE_PATHNAME', 'lda_perplexity_prefunc' 00679 LANGUAGE C IMMUTABLE STRICT; 00680 00681 /** 00682 * @brief This UDF is the finalfunc for the aggregator computing the perplexity. 00683 * @param state The merged state 00684 * @return The perpleixty 00685 **/ 00686 CREATE OR REPLACE FUNCTION 00687 MADLIB_SCHEMA.__lda_perplexity_ffunc 00688 ( 00689 state INT4[] 00690 ) 00691 RETURNS FLOAT8 00692 AS 'MODULE_PATHNAME', 'lda_perplexity_ffunc' 00693 LANGUAGE C IMMUTABLE STRICT; 00694 00695 /** 00696 * @brief This aggregator computes the perpleixty. 00697 * @param words The unique words in the document 00698 * @param counts The counts of each unique words in the document 00699 * @param doc_topic The topic counts in the document 00700 * @param model The learned model 00701 * @param alpha The Dirichlet parameter for per-document topic multinomial 00702 * @param beta The Dirichlet parameter for per-topic word multinomial 00703 * @param voc_size The size of vocabulary 00704 * @param topic_num The number of topics 00705 * @return The updated perplexity 00706 **/ 00707 DROP AGGREGATE IF EXISTS 00708 MADLIB_SCHEMA.__lda_perplexity_agg 00709 ( 00710 INT4[], 00711 INT4[], 00712 INT4[], 00713 INT4[], 00714 FLOAT8, 00715 FLOAT8, 00716 INT4, 00717 INT4 00718 ); 00719 CREATE AGGREGATE 00720 MADLIB_SCHEMA.__lda_perplexity_agg 00721 ( 00722 INT4[], 00723 INT4[], 00724 INT4[], 00725 INT4[], 00726 FLOAT8, 00727 FLOAT8, 00728 INT4, 00729 INT4 00730 ) 00731 ( 00732 stype = INT4[], 00733 sfunc = MADLIB_SCHEMA.__lda_perplexity_sfunc, 00734 finalfunc = MADLIB_SCHEMA.__lda_perplexity_ffunc 00735 m4_ifdef( 00736 `GREENPLUM', 00737 `, prefunc = MADLIB_SCHEMA.__lda_perplexity_prefunc' 00738 ) 00739 ); 00740 00741 /** 00742 * @brief Unnest a 2-D array into a set of 1-D arrays 00743 * @param arr The 2-D array to be unnested 00744 * @return The unnested 1-D arrays 00745 **/ 00746 CREATE OR REPLACE FUNCTION 00747 MADLIB_SCHEMA.__lda_util_unnest 00748 ( 00749 arr INT4[][] 00750 ) 00751 RETURNS SETOF INT4[] 00752 AS 'MODULE_PATHNAME', 'lda_unnest' 00753 LANGUAGE C IMMUTABLE STRICT; 00754 00755 /** 00756 * @brief Transpose a 2-D array 00757 * @param matrix The input 2-D array 00758 * @param The transposed array 00759 **/ 00760 CREATE OR REPLACE FUNCTION 00761 MADLIB_SCHEMA.__lda_util_transpose 00762 ( 00763 matrix INT4[][] 00764 ) 00765 RETURNS INT4[][] 00766 AS 'MODULE_PATHNAME', 'lda_transpose' 00767 LANGUAGE C IMMUTABLE STRICT; 00768 00769 /** 00770 * @brief L1 normalization with smoothing 00771 * @param arr The array to be normalized 00772 * @param smooth The smoothing parameter 00773 * @return The normalized vector 00774 **/ 00775 CREATE OR REPLACE FUNCTION 00776 MADLIB_SCHEMA.__lda_util_norm_with_smoothing 00777 ( 00778 arr FLOAT8[], 00779 smooth FLOAT8 00780 ) 00781 RETURNS FLOAT8[] AS $$ 00782 PythonFunctionBodyOnly(`lda', `lda') 00783 return lda.l1_norm_with_smoothing(arr, smooth) 00784 $$ LANGUAGE PLPYTHONU STRICT; 00785 00786 /** 00787 * @brief This UDF returns the index of elements in a sorted order 00788 * @param arr The array to be sorted 00789 * @return The index of elements 00790 **/ 00791 CREATE OR REPLACE FUNCTION 00792 MADLIB_SCHEMA.__lda_util_index_sort 00793 ( 00794 arr FLOAT8[] 00795 ) 00796 RETURNS INT4[] AS $$ 00797 PythonFunctionBodyOnly(`lda', `lda') 00798 return lda.index_sort(arr) 00799 $$ LANGUAGE plpythonu STRICT; 00800 00801 /** 00802 * @brief This UDF checks the vocabulary and converts non-continous wordids into 00803 * continuous integers ranging from 0 to voc_size - 1. 00804 * @param vocab_table The vocabulary table in the form of 00805 <wordid::int4, word::text> 00806 * @param output_vocab_table The regularized vocabulary table 00807 **/ 00808 CREATE OR REPLACE FUNCTION 00809 MADLIB_SCHEMA.__lda_util_norm_vocab 00810 ( 00811 vocab_table TEXT, 00812 output_vocab_table TEXT 00813 ) 00814 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00815 PythonFunctionBodyOnly(`lda', `lda') 00816 lda.norm_vocab(vocab_table, output_vocab_table) 00817 return [[output_vocab_table,'normalized vocbulary table']] 00818 $$ LANGUAGE plpythonu STRICT; 00819 00820 /** 00821 * @brief This UDF converts the data table according to the normalized 00822 * vocabulary, and all rows with non-positive count values will be removed 00823 * @param data_table The data table to be normalized 00824 * @param vocab_table The normalized vocabulary table 00825 * @param output_data_table The normalized data table 00826 **/ 00827 CREATE OR REPLACE FUNCTION 00828 MADLIB_SCHEMA.__lda_util_norm_dataset 00829 ( 00830 data_table TEXT, 00831 norm_vocab_table TEXT, 00832 output_data_table TEXT 00833 ) 00834 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00835 PythonFunctionBodyOnly(`lda', `lda') 00836 lda.norm_dataset(data_table, norm_vocab_table, output_data_table) 00837 return [[output_data_table,'normalized data table']] 00838 $$ LANGUAGE plpythonu STRICT; 00839 00840 /** 00841 * @brief This UDF extracts the list of wordids from the data table and joins 00842 * it with the vocabulary table to get the list of common wordids, next it will 00843 * normalize the vocabulary based on the common wordids and then normalize the 00844 * data table based on the normalized vocabulary. 00845 * @param data_table The data table to be normalized 00846 * @param vocab_table The vocabulary table to be normalized 00847 * @param output_data_table The normalized data table 00848 * @param output_vocab_table The normalized vocabulary table 00849 **/ 00850 CREATE OR REPLACE FUNCTION 00851 MADLIB_SCHEMA.__lda_util_conorm_data 00852 ( 00853 data_table TEXT, 00854 vocab_table TEXT, 00855 output_data_table TEXT, 00856 output_vocab_table TEXT 00857 ) 00858 RETURNS SETOF MADLIB_SCHEMA.lda_result AS $$ 00859 PythonFunctionBodyOnly(`lda', `lda') 00860 lda.conorm_data( 00861 data_table, vocab_table, output_data_table, output_vocab_table) 00862 return [[output_data_table,'normalized data table'], 00863 [output_vocab_table,'normalized vocab table']] 00864 $$ LANGUAGE plpythonu STRICT; 00865