User Documentation
 All Files Functions Groups
Latent Dirichlet Allocation
+ Collaboration diagram for Latent Dirichlet Allocation:
About:

Latent Dirichlet Allocation (LDA) is an interesting generative probabilistic model for natural texts and has received a lot of attention in recent years. The model is quite versatile, having found uses in problems like automated topic discovery, collaborative filtering, and document classification.

The LDA model posits that each document is associated with a mixture of various topics (e.g. a document is related to Topic 1 with probability 0.7, and Topic 2 with probability 0.3), and that each word in the document is attributable to one of the document's topics. There is a (symmetric) Dirichlet prior with parameter \( \alpha \) on each document's topic mixture. In addition, there is another (symmetric) Dirichlet prior with parameter \( \beta \) on the distribution of words for each topic.

The following generative process then defines a distribution over a corpus of documents.

In practice, only the words in each document are observable. The topic mixture of each document and the topic for each word in each document are latent unobservable variables that need to be inferred from the observables, and this is the problem people refer to when they talk about the inference problem for LDA. Exact inference is intractable, but several approximate inference algorithms for LDA have been developed. The simple and effective Gibbs sampling algorithm described in Griffiths and Steyvers [2] appears to be the current algorithm of choice.

This implementation provides a parallel and scalable in-database solution for LDA based on Gibbs sampling. Different with the implementations based on MPI or Hadoop Map/Reduce, this implementation builds upon the shared-nothing MPP databases and enables high-performance in-database analytics.

Input:
The corpus/dataset to be analyzed is expected to be of the following form:
{TABLE|VIEW} data_table (
    docid INTEGER,
    wordid INTEGER,
    count INTEGER
)
where docid refers to the document ID, wordid is the word ID (the index of a word in the vocabulary), and count is the number of occurence of the word in the document.

The vocabulary/dictionary that indexes all the words found in the corpus is of the following form:

{TABLE|VIEW} vocab_table (
    wordid INTEGER,
    word TEXT,
)

where wordid refers the word ID (the index of a word in the vocabulary) and word is the actual word.

Usage:
  • The training (i.e. topic inference) can be done with the following function:

            SELECT lda_train(
                'data_table',
                'model_table',
                'output_data_table', 
                voc_size, 
                topic_num,
                iter_num, 
                alpha, 
                beta)
        

    This function stores the resulting model in model_table. The table has only 1 row and is in the following form:

    {TABLE} model_table (
            voc_size INTEGER,
            topic_num INTEGER,
            alpha FLOAT,
            beta FLOAT,
            model INTEGER[][])
        

    This function also stores the topic counts and the topic assignments in each document in output_data_table. The table is in the following form:

    {TABLE} output_data_table (
            docid INTEGER,
            wordcount INTEGER,
            words INTEGER[],
            counts INTEGER[],
            topic_count INTEGER[],
            topic_assignment INTEGER[])
        
  • The prediction (i.e. labelling of test documents using a learned LDA model) can be done with the following function:

            SELECT lda_predict(
                'data_table',
                'model_table',
                'output_table');
        

    This function stores the prediction results in output_table. Each row in the table stores the topic distribution and the topic assignments for a docuemnt in the dataset. And the table is in the following form:

    {TABLE} output_table (
            docid INTEGER,
            wordcount INTEGER,
            words INTEGER,
            counts INTEGER,
            topic_count INTEGER[],
            topic_assignment INTEGER[])
        
  • This module also provides a function for computing the perplexity:
            SELECT lda_get_perplexity(
                'model_table',
                'output_data_table');
        
Implementation Notes:
The input format for this module is very common in many machine learning packages written in various lanugages, which allows users to generate datasets using any existing document preprocessing tools or import existing dataset very conveniently. Internally, the input data will be validated and then converted to the following format for efficiency:
{TABLE} __internal_data_table__ (
        docid INTEGER,
        wordcount INTEGER,
        words INTEGER[],
        counts INTEGER[])
    
where docid is the document ID, wordcount is the count of words in the document, words is the list of unique words in the document, and counts is the list of number of occurence of each unique word in the document. The convertion can be done with the help of aggregation functions very easily.
Examples:

We now give a usage example.

Literature:

[1] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[2] T. Griffiths and M. Steyvers, Finding scientific topics, PNAS, vol. 101, pp. 5228-5235, 2004.

[3] Y. Wang, H. Bai, M. Stanton, W-Y. Chen, and E.Y. Chang, lda: Parallel Dirichlet Allocation for Large-scale Applications, AAIM, 2009.

[4] http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

[5] J. Chang, Collapsed Gibbs sampling methods for topic models, R manual, 2010.

See Also
File lda.sql_in documenting the SQL functions.