User Documentation
Conditional Random Field
+ Collaboration diagram for Conditional Random Field:
About:
A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.

Specifically, a linear-chain CRF is a distribution defined by

\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]

where

A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).

The log-likelihood is defined as

\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]

and the zero of its gradient

\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]

is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.

\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:

\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]

\[ Z_\lambda(x) = \alpha_n.1^T \]

where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by

\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]

\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]

To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:

\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]

\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]

Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:

A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).

For a full example of how to use the MADlib CRF modules for a text analytics application, see the "Example" section below.

Input:
  • User-provided input:
    The user is expected to at least provide the label table, the regular expression table, and the segment table:
    {TABLE|VIEW} labelTableName (
        ...
        id INTEGER,
        label TEXT,
        ...
    )
    where id is a unique ID for the label and label is the label name.
    {TABLE|VIEW} regexTableName (
        ...
        pattern TEXT,
        name TEXT,
        ...
    )
    where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
    {TABLE|VIEW} segmentTableName (
        ...
        start_pos INTEGER,
        doc_id INTEGER,
        seg_text TEXT,
        label INTEGER,
        max_pos INTEGER,
        ...
    )
    where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.

The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):

{TABLE|VIEW} featureSetName (
    ...
    f_index INTEGER,
    f_name TEXT,
    feature_labels INTEGER[],
    ...
)

where

The empty feature weight table (which will be populated after training) is expected to be of the following form:

{TABLE|VIEW} featureWeightsName (
    ...
    f_index INTEGER,
    f_name TEXT,
    previous_label INTEGER,
    label INTEGER,
    weight FLOAT8,
    ...
)
Usage:
  • Get number of iterations and weights for features:
    SELECT * FROM lincrf(
        'featureTableName', 'sparse_r', 'dense_m','sparse_m', 'f_size', tag_size, 'feature_set', 'featureWeightsName'
        [, maxNumberOfIterations ] ]
    );
    where tag_size is the total number of labels.

Output:

 lincrf
-----------------
 [number of iterations]

featureWeightsName:

 id |      name      | prev_label_id | label_id |      weight       
----+----------------+---------------+----------+-------------------
Examples:
  1. Load the label table, the regular expressions table, and the training segment table:
    sql> SELECT * FROM crf_label;
     id | label 
    ----+-------
      1 | CD
     13 | NNP
     15 | PDT
     17 | PRP
     29 | VBN
     31 | VBZ
     33 | WP
     35 | WRB
    ...
    
    sql> SELECT * from crf_regex;
        pattern    |         name         
    ---------------+----------------------
     ^.+ing$       | endsWithIng
     ^[A-Z][a-z]+$ | InitCapital
     ^[A-Z]+$      | isAllCapital
     ^.*[0-9]+.*$  | containsDigit
    ...
    
    sql> SELECT * from train_segmenttbl;
     start_pos | doc_id |  seg_text  | label | max_pos
    -----------+--------+------------+-------+---------
             8 |      1 | alliance   |    11 |      26
            10 |      1 | Ford       |    13 |      26
            12 |      1 | that       |     5 |      26
            24 |      1 | likely     |     6 |      26
            26 |      1 | .          |    43 |      26
             8 |      2 | interest   |    11 |      10
            10 |      2 | .          |    43 |      10
             9 |      1 | after      |     5 |      26
            11 |      1 | concluded  |    27 |      26
            23 |      1 | the        |     2 |      26
            25 |      1 | return     |    11 |      26
             9 |      2 | later      |    19 |      10
    ...
    
  2. Create the (empty) dictionary table, feature table, and feature set:
    sql> CREATE TABLE crf_dictionary(token text,total integer);
    sql> CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
    sql> CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
    
  3. Generate the training features:
    sql> SELECT crf_train_fgen('train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl','train_featureset');
    
    sql> SELECT * from crf_dictionary;
       token    | total 
    ------------+-------
     talks      |     1
     that       |     1
     would      |     1
     alliance   |     1
     Saab       |     2
     cost       |     1
     after      |     1
     operations |     1
    ...
    
    sql> SELECT * from train_featuretbl;
     doc_id | f_size |            sparse_r           |             dense_m             |       sparse_m
    --------+--------+-------------------------------+---------------------------------+-----------------------
          2 |     87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
          1 |     87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
    
    sql> SELECT * from train_featureset;
     f_index |    f_name     | feature 
    ---------+---------------+---------
           1 | R_endsWithED  | {-1,29}
          13 | W_outweigh    | {-1,26}
          29 | U             | {-1,5}
          31 | U             | {-1,29}
          33 | U             | {-1,12}
          35 | W_a           | {-1,2}
          37 | W_possible    | {-1,6}
          15 | W_signaled    | {-1,29}
          17 | End.          | {-1,43}
          49 | W_'s          | {-1,16}
          63 | W_acquire     | {-1,26}
          51 | E.            | {26,2}
          69 | E.            | {29,17}
          71 | E.            | {2,11}
          83 | W_the         | {-1,2}
          85 | E.            | {16,11}
           4 | W_return      | {-1,11}
    ...
    
    
  4. Create the (empty) feature weight table:
    sql> CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
    
  5. Train using linear CRF:
    sql> SELECT lincrf('train_featuretbl','sparse_r','dense_m','sparse_m','f_size',45, 'train_featureset','train_crf_feature', 20);
     lincrf 
    --------
         20
    
    sql> SELECT * from train_crf_feature;
     id |     name      | prev_label_id | label_id |      weight       
    ----+---------------+---------------+----------+-------------------
      1 | R_endsWithED  |            -1 |       29 |  1.54128249293937
     13 | W_outweigh    |            -1 |       26 |  1.70691232223653
     29 | U             |            -1 |        5 |  1.40708515869008
     31 | U             |            -1 |       29 | 0.830356200936407
     33 | U             |            -1 |       12 | 0.769587378281239
     35 | W_a           |            -1 |        2 |  2.68470625883726
     37 | W_possible    |            -1 |        6 |  3.41773107604468
     15 | W_signaled    |            -1 |       29 |  1.68187039165771
     17 | End.          |            -1 |       43 |  3.07687845517082
     49 | W_'s          |            -1 |       16 |  2.61430312229883
     63 | W_acquire     |            -1 |       26 |  1.67247047385797
     51 | E.            |            26 |        2 |   3.0114240119435
     69 | E.            |            29 |       17 |  2.82385531733866
     71 | E.            |             2 |       11 |  3.00970493772732
     83 | W_the         |            -1 |        2 |  2.58742315259326
    ...
    
    
  6. To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
    sql> SELECT * from test_segmenttbl;
     start_pos | doc_id |  seg_text   | max_pos 
    -----------+--------+-------------+---------
             1 |      1 | collapse    |      22
            13 |      1 | ,           |      22
            15 |      1 | is          |      22
            17 |      1 | a           |      22
             4 |      1 | speculation |      22
             6 |      1 | Ford        |      22
            18 |      1 | defensive   |      22
            20 |      1 | with        |      22
    ...
    
    sql> SELECT crf_test_fgen('test_segmenttbl','crf_dictionary','crf_label','crf_regex','train_crf_feature','viterbi_mtbl','viterbi_rtbl');
    
  7. Calculate the best label sequence:
    sql> SELECT vcrf_label('test_segmenttbl','viterbi_mtbl','viterbi_rtbl','crf_label','extracted_best_labels');
    
    sql> SELECT * FROM extracted_best_labels;
     doc_id | start_pos |  seg_text   | label | id | prob  
    --------+-----------+-------------+-------+----+-------
          1 |         2 | Friday      | NNP   | 14 | 9e-06
          1 |         6 | Ford        | NNP   | 14 | 9e-06
          1 |        12 | Jaguar      | NNP   | 14 | 9e-06
          1 |         3 | prompted    | VBD   | 28 | 9e-06
          1 |         8 | intensify   | NN    | 12 | 9e-06
          1 |        14 | which       | NN    | 12 | 9e-06
          1 |        18 | defensive   | NN    | 12 | 9e-06
          1 |        21 | GM          | NN    | 12 | 9e-06
          1 |        22 | .           | .     | 44 | 9e-06
          1 |         1 | collapse    | CC    |  1 | 9e-06
          1 |         7 | would       | POS   | 17 | 9e-06
    ...
    
    (Note that this example was done on a trivial training and test data set.)
Literature:
[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf

[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field

[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/

[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html

[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm

[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782

[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

See also:
File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)