User Documentation
 All Files Functions Groups
Conditional Random Field
+ Collaboration diagram for Conditional Random Field:
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.
About:
A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.

Specifically, a linear-chain CRF is a distribution defined by

\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]

where

A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).

The log-likelihood is defined as

\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]

and the zero of its gradient

\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]

is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.

\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:

\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]

\[ Z_\lambda(x) = \alpha_n.1^T \]

where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by

\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]

\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]

To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:

\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]

\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]

Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:

A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).

For a full example of how to use the MADlib CRF modules for a text analytics application, see the "Example" section below.

Input:
  • User-provided input:
    The user is expected to at least provide the label table, the regular expression table, and the segment table:
    {TABLE|VIEW} labelTableName (
        ...
        id INTEGER,
        label TEXT,
        ...
    )
    where id is a unique ID for the label and label is the label name.
    {TABLE|VIEW} regexTableName (
        ...
        pattern TEXT,
        name TEXT,
        ...
    )
    where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
    {TABLE|VIEW} segmentTableName (
        ...
        start_pos INTEGER,
        doc_id INTEGER,
        seg_text TEXT,
        label INTEGER,
        max_pos INTEGER,
        ...
    )
    where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.
  • Training (lincrf) input:
    The feature table used for training is expected to be of the following form (this table can also be generated by crf_train_fgen):
    {TABLE|VIEW} featureTableName (
        ...
        doc_id INTEGER,
        f_size INTEGER,
        sparse_r FLOAT8[],
        dense_m FLOAT8[],
        sparse_m FLOAT8[],
        ...
    )
    where
    • doc_id is a unique ID for the sequence
    • f_size is the number of features
    • sparse_r is the array union of (previous label, label, feature index, start position, training existance indicator) of individal single-state features (e.g. word features, regex features) ordered by their start positon
    • dense_m is the array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position
    • sparse_m is the array union of (feature index, previous label, label) of edge features ordered by feature index. Edge features were split into dense_m and sparse_m for performance reasons.

The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):

{TABLE|VIEW} featureSetName (
    ...
    f_index INTEGER,
    f_name TEXT,
    feature_labels INTEGER[],
    ...
)

where

The empty feature weight table (which will be populated after training) is expected to be of the following form:

{TABLE|VIEW} featureWeightsName (
    ...
    f_index INTEGER,
    f_name TEXT,
    previous_label INTEGER,
    label INTEGER,
    weight FLOAT8,
    ...
)
Usage:
  • Get number of iterations and weights for features:

    SELECT * FROM lincrf(
        'featureTableName', 'sparse_r', 'dense_m','sparse_m', 'f_size', tag_size, 'feature_set', 'featureWeightsName'
        [, maxNumberOfIterations ] ]
    );

    where tag_size is the total number of labels.

    Output:

     lincrf
    -----------------
     [number of iterations]

    featureWeightsName:

     id |      name      | prev_label_id | label_id |      weight       
    ----+----------------+---------------+----------+-------------------
    
  • Generate text features, calculate their weights, and output the best label sequence for test data:
    1. Create tables to store the input data, intermediate data, and output data. Also import the training data to the database.
      SELECT madlib.crf_train_data(
               '/path/to/data');
    2. Generate text analytics features for the training data.
      SELECT madlib.crf_train_fgen(
               'segmenttbl',
               'regextbl',
               'dictionary',
               'featuretbl',
               'featureset');
    3. Use linear-chain CRF for training.
      SELECT madlib.lincrf(
               'source',
               'sparse_r',
               'dense_m',
               'sparse_m',
               'f_size',
               tag_size,
               'feature_set',
               'featureWeights',
               'maxNumIterations');
    4. Import CRF model to the database. Also load the CRF testing data to the database.
      SELECT madlib.crf_test_data(
               '/path/to/data');
    5. Generate text analytics features for the testing data.
      SELECT madlib.crf_test_fgen(
               'segmenttbl',
               'dictionary',
               'labeltbl',
               'regextbl',
               'featuretbl',
               'viterbi_mtbl',
               'viterbi_rtbl');
      'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
    6. Run the Viterbi function to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).
      SELECT madlib.vcrf_label(
               'segmenttbl',
               'viterbi_mtbl',
               'viterbi_rtbl',
               'labeltbl',
               'resulttbl');
Examples:
  1. Load the label table, the regular expressions table, and the training segment table:
    sql> SELECT * FROM crf_label;
     id | label 
    ----+-------
      1 | CD
     13 | NNP
     15 | PDT
     17 | PRP
     29 | VBN
     31 | VBZ
     33 | WP
     35 | WRB
    ...
    
    sql> SELECT * from crf_regex;
        pattern    |         name         
    ---------------+----------------------
     ^.+ing$       | endsWithIng
     ^[A-Z][a-z]+$ | InitCapital
     ^[A-Z]+$      | isAllCapital
     ^.*[0-9]+.*$  | containsDigit
    ...
    
    sql> SELECT * from train_segmenttbl;
     start_pos | doc_id |  seg_text  | label | max_pos
    -----------+--------+------------+-------+---------
             8 |      1 | alliance   |    11 |      26
            10 |      1 | Ford       |    13 |      26
            12 |      1 | that       |     5 |      26
            24 |      1 | likely     |     6 |      26
            26 |      1 | .          |    43 |      26
             8 |      2 | interest   |    11 |      10
            10 |      2 | .          |    43 |      10
             9 |      1 | after      |     5 |      26
            11 |      1 | concluded  |    27 |      26
            23 |      1 | the        |     2 |      26
            25 |      1 | return     |    11 |      26
             9 |      2 | later      |    19 |      10
    ...
    
  2. Create the (empty) dictionary table, feature table, and feature set:
    sql> CREATE TABLE crf_dictionary(token text,total integer);
    sql> CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
    sql> CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
    
  3. Generate the training features:
    sql> SELECT crf_train_fgen('train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl','train_featureset');
    
    sql> SELECT * from crf_dictionary;
       token    | total 
    ------------+-------
     talks      |     1
     that       |     1
     would      |     1
     alliance   |     1
     Saab       |     2
     cost       |     1
     after      |     1
     operations |     1
    ...
    
    sql> SELECT * from train_featuretbl;
     doc_id | f_size |            sparse_r           |             dense_m             |       sparse_m
    --------+--------+-------------------------------+---------------------------------+-----------------------
          2 |     87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
          1 |     87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
    
    sql> SELECT * from train_featureset;
     f_index |    f_name     | feature 
    ---------+---------------+---------
           1 | R_endsWithED  | {-1,29}
          13 | W_outweigh    | {-1,26}
          29 | U             | {-1,5}
          31 | U             | {-1,29}
          33 | U             | {-1,12}
          35 | W_a           | {-1,2}
          37 | W_possible    | {-1,6}
          15 | W_signaled    | {-1,29}
          17 | End.          | {-1,43}
          49 | W_'s          | {-1,16}
          63 | W_acquire     | {-1,26}
          51 | E.            | {26,2}
          69 | E.            | {29,17}
          71 | E.            | {2,11}
          83 | W_the         | {-1,2}
          85 | E.            | {16,11}
           4 | W_return      | {-1,11}
    ...
  4. Create the (empty) feature weight table:
    sql> CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
    
  5. Train using linear CRF:
    sql> SELECT lincrf('train_featuretbl','sparse_r','dense_m','sparse_m','f_size',45, 'train_featureset','train_crf_feature', 20);
     lincrf 
    --------
         20
    
    sql> SELECT * from train_crf_feature;
     id |     name      | prev_label_id | label_id |      weight       
    ----+---------------+---------------+----------+-------------------
      1 | R_endsWithED  |            -1 |       29 |  1.54128249293937
     13 | W_outweigh    |            -1 |       26 |  1.70691232223653
     29 | U             |            -1 |        5 |  1.40708515869008
     31 | U             |            -1 |       29 | 0.830356200936407
     33 | U             |            -1 |       12 | 0.769587378281239
     35 | W_a           |            -1 |        2 |  2.68470625883726
     37 | W_possible    |            -1 |        6 |  3.41773107604468
     15 | W_signaled    |            -1 |       29 |  1.68187039165771
     17 | End.          |            -1 |       43 |  3.07687845517082
     49 | W_'s          |            -1 |       16 |  2.61430312229883
     63 | W_acquire     |            -1 |       26 |  1.67247047385797
     51 | E.            |            26 |        2 |   3.0114240119435
     69 | E.            |            29 |       17 |  2.82385531733866
     71 | E.            |             2 |       11 |  3.00970493772732
     83 | W_the         |            -1 |        2 |  2.58742315259326
    ...
  6. To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
    sql> SELECT * from test_segmenttbl;
     start_pos | doc_id |  seg_text   | max_pos 
    -----------+--------+-------------+---------
             1 |      1 | collapse    |      22
            13 |      1 | ,           |      22
            15 |      1 | is          |      22
            17 |      1 | a           |      22
             4 |      1 | speculation |      22
             6 |      1 | Ford        |      22
            18 |      1 | defensive   |      22
            20 |      1 | with        |      22
    ...
    
    sql> SELECT crf_test_fgen('test_segmenttbl','crf_dictionary','crf_label','crf_regex','train_crf_feature','viterbi_mtbl','viterbi_rtbl');
    
  7. Calculate the best label sequence:
    sql> SELECT vcrf_label('test_segmenttbl','viterbi_mtbl','viterbi_rtbl','crf_label','extracted_best_labels');
    
    sql> SELECT * FROM extracted_best_labels;
     doc_id | start_pos |  seg_text   | label | id | prob  
    --------+-----------+-------------+-------+----+-------
          1 |         2 | Friday      | NNP   | 14 | 9e-06
          1 |         6 | Ford        | NNP   | 14 | 9e-06
          1 |        12 | Jaguar      | NNP   | 14 | 9e-06
          1 |         3 | prompted    | VBD   | 28 | 9e-06
          1 |         8 | intensify   | NN    | 12 | 9e-06
          1 |        14 | which       | NN    | 12 | 9e-06
          1 |        18 | defensive   | NN    | 12 | 9e-06
          1 |        21 | GM          | NN    | 12 | 9e-06
          1 |        22 | .           | .     | 44 | 9e-06
          1 |         1 | collapse    | CC    |  1 | 9e-06
          1 |         7 | would       | POS   | 17 | 9e-06
    ...
    
    (Note that this example was done on a trivial training and test data set.)
Literature:
[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf

[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field

[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/

[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html

[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm

[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782

[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

See Also
File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)