MADlib  1.4.1
User Documentation
 All Files Functions Variables Groups
Conditional Random Field
Warning
This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.

A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.

Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:

A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).

Training Function
Get number of iterations and weights for features:
lincrf( source, 
        sparse_R, 
        dense_M, 
        sparse_M, 
        featureSize, 
        tagSize, 
        featureset, 
        crf_feature,
        maxNumIterations
      )

Arguments

source
Name of the source relation containing the training data
sparse_R
Name of the sparse single state feature column (of type DOUBLE PRECISION[])
dense_M
Name of the dense two state feature column (of type DOUBLE PRECISION[])
sparse_M
Name of the sparse two state feature column (of type DOUBLE PRECISION[])
featureSize
Name of feature size column (of type DOUBLE PRECISION)
tagSize
The number of tags in the tag set
featureset
The unique feature set
crf_feature
The Name of output feature table
maxNumIterations
The maximum number of iterations

The features and weights are stored in the table named by crf_feature. This function returns a composite value containing the following columns:

coef FLOAT8[]. Array of coefficients
log_likelihood FLOAT8. Log-likelihood
num_iterations INTEGER. The number of iterations before the algorithm terminated

Using CRF

Generate text features, calculate their weights, and output the best label sequence for test data:

  1. Create tables to store the input data, intermediate data, and output data. Also import the training data to the database.
        SELECT madlib.crf_train_data( '/path/to/data');
        
  2. Generate text analytics features for the training data.
    SELECT madlib.crf_train_fgen(
             'segmenttbl',
             'regextbl',
             'dictionary',
             'featuretbl',
             'featureset');
  3. Use linear-chain CRF for training.
    SELECT madlib.lincrf(
             'source',
             'sparse_r',
             'dense_m',
             'sparse_m',
             'f_size',
             tag_size,
             'feature_set',
             'featureWeights',
             'maxNumIterations');
  4. Import CRF model to the database. Also load the CRF testing data to the database.
    SELECT madlib.crf_test_data(
             '/path/to/data');
  5. Generate text analytics features for the testing data.
    SELECT madlib.crf_test_fgen(
             'segmenttbl',
             'dictionary',
             'labeltbl',
             'regextbl',
             'featuretbl',
             'viterbi_mtbl',
             'viterbi_rtbl');
    'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
  6. Run the Viterbi function to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).
    SELECT madlib.vcrf_label(
             'segmenttbl',
             'viterbi_mtbl',
             'viterbi_rtbl',
             'labeltbl',
             'resulttbl');

Input
  • User-provided input:
    The user is expected to at least provide the label table, the regular expression table, and the segment table:
    {TABLE|VIEW} labelTableName (
        ...
        id INTEGER,
        label TEXT,
        ...
    )
    where id is a unique ID for the label and label is the label name.
    {TABLE|VIEW} regexTableName (
        ...
        pattern TEXT,
        name TEXT,
        ...
    )
    where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
    {TABLE|VIEW} segmentTableName (
        ...
        start_pos INTEGER,
        doc_id INTEGER,
        seg_text TEXT,
        label INTEGER,
        max_pos INTEGER,
        ...
    )
    where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.
  • Training (lincrf) input:
    The feature table used for training is expected to be of the following form (this table can also be generated by crf_train_fgen):
    {TABLE|VIEW} featureTableName (
        ...
        doc_id INTEGER,
        f_size INTEGER,
        sparse_r FLOAT8[],
        dense_m FLOAT8[],
        sparse_m FLOAT8[],
        ...
    )
    where
    • doc_id is a unique ID for the sequence
    • f_size is the number of features
    • sparse_r is the array union of (previous label, label, feature index, start position, training existance indicator) of individal single-state features (e.g. word features, regex features) ordered by their start positon
    • dense_m is the array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position
    • sparse_m is the array union of (feature index, previous label, label) of edge features ordered by feature index. Edge features were split into dense_m and sparse_m for performance reasons.

The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):

{TABLE|VIEW} featureSetName (
    ...
    f_index INTEGER,
    f_name TEXT,
    feature_labels INTEGER[],
    ...
)

where

The empty feature weight table (which will be populated after training) is expected to be of the following form:

{TABLE|VIEW} featureWeightsName (
    ...
    f_index INTEGER,
    f_name TEXT,
    previous_label INTEGER,
    label INTEGER,
    weight FLOAT8,
    ...
)

Examples
This example uses a trivial training and test data set.
  1. Load the label table, the regular expressions table, and the training segment table:
     
    SELECT * FROM crf_label;
    
    Result:
     id | label 
     ---+-------
      1 | CD
     13 | NNP
     15 | PDT
     17 | PRP
     29 | VBN
     31 | VBZ
     33 | WP
     35 | WRB
    ...
    
    The regular expressions table:
    SELECT * from crf_regex;
    
        pattern    |         name         
     --------------+----------------------
     ^.+ing$       | endsWithIng
     ^[A-Z][a-z]+$ | InitCapital
     ^[A-Z]+$      | isAllCapital
     ^.*[0-9]+.*$  | containsDigit
    ...
    
    The training segment table:
    SELECT * from train_segmenttbl;
    
     start_pos | doc_id |  seg_text  | label | max_pos
     ----------+--------+------------+-------+---------
             8 |      1 | alliance   |    11 |      26
            10 |      1 | Ford       |    13 |      26
            12 |      1 | that       |     5 |      26
            24 |      1 | likely     |     6 |      26
            26 |      1 | .          |    43 |      26
             8 |      2 | interest   |    11 |      10
            10 |      2 | .          |    43 |      10
             9 |      1 | after      |     5 |      26
            11 |      1 | concluded  |    27 |      26
            23 |      1 | the        |     2 |      26
            25 |      1 | return     |    11 |      26
             9 |      2 | later      |    19 |      10
    ...
    
  2. Create the (empty) dictionary table, feature table, and feature set:
    CREATE TABLE crf_dictionary(token text,total integer);
    CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
    CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
    
  3. Generate the training features:
    SELECT crf_train_fgen( 'train_segmenttbl', 
                           'crf_regex', 
                           'crf_dictionary', 
                           'train_featuretbl',
                           'train_featureset'
                         );
    SELECT * from crf_dictionary;
    
    Result:
       token    | total 
     -----------+-------
     talks      |     1
     that       |     1
     would      |     1
     alliance   |     1
     Saab       |     2
     cost       |     1
     after      |     1
     operations |     1
    ...
    
    SELECT * from train_featuretbl;
    
    Result:
     doc_id | f_size |            sparse_r           |             dense_m             |       sparse_m
     -------+--------+-------------------------------+---------------------------------+-----------------------
          2 |     87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
          1 |     87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
    
    SELECT * from train_featureset;
    
     f_index |    f_name     | feature 
     --------+---------------+---------
           1 | R_endsWithED  | {-1,29}
          13 | W_outweigh    | {-1,26}
          29 | U             | {-1,5}
          31 | U             | {-1,29}
          33 | U             | {-1,12}
          35 | W_a           | {-1,2}
          37 | W_possible    | {-1,6}
          15 | W_signaled    | {-1,29}
          17 | End.          | {-1,43}
          49 | W_'s          | {-1,16}
          63 | W_acquire     | {-1,26}
          51 | E.            | {26,2}
          69 | E.            | {29,17}
          71 | E.            | {2,11}
          83 | W_the         | {-1,2}
          85 | E.            | {16,11}
           4 | W_return      | {-1,11}
    ...
    
  4. Create the (empty) feature weight table:
    CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
    
  5. Train using linear CRF:
    SELECT lincrf( 'train_featuretbl',
                   'sparse_r',
                   'dense_m',
                   'sparse_m',
                   'f_size',45, 
                   'train_featureset',
                   'train_crf_feature', 
                   20
                 );
    
     lincrf 
     -------
         20
    
    View the feature weight table.
    SELECT * from train_crf_feature;
    
    Result:
     id |     name      | prev_label_id | label_id |      weight       
     ---+---------------+---------------+----------+-------------------
      1 | R_endsWithED  |            -1 |       29 |  1.54128249293937
     13 | W_outweigh    |            -1 |       26 |  1.70691232223653
     29 | U             |            -1 |        5 |  1.40708515869008
     31 | U             |            -1 |       29 | 0.830356200936407
     33 | U             |            -1 |       12 | 0.769587378281239
     35 | W_a           |            -1 |        2 |  2.68470625883726
     37 | W_possible    |            -1 |        6 |  3.41773107604468
     15 | W_signaled    |            -1 |       29 |  1.68187039165771
     17 | End.          |            -1 |       43 |  3.07687845517082
     49 | W_'s          |            -1 |       16 |  2.61430312229883
     63 | W_acquire     |            -1 |       26 |  1.67247047385797
     51 | E.            |            26 |        2 |   3.0114240119435
     69 | E.            |            29 |       17 |  2.82385531733866
     71 | E.            |             2 |       11 |  3.00970493772732
     83 | W_the         |            -1 |        2 |  2.58742315259326
    ...
    
  6. To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
    SELECT * from test_segmenttbl;
    
    Result:
     start_pos | doc_id |  seg_text   | max_pos 
     ----------+--------+-------------+---------
             1 |      1 | collapse    |      22
            13 |      1 | ,           |      22
            15 |      1 | is          |      22
            17 |      1 | a           |      22
             4 |      1 | speculation |      22
             6 |      1 | Ford        |      22
            18 |      1 | defensive   |      22
            20 |      1 | with        |      22
    ...
    
    SELECT crf_test_fgen( 'test_segmenttbl',
                          'crf_dictionary',
                          'crf_label',
                          'crf_regex',
                          'train_crf_feature',
                          'viterbi_mtbl',
                          'viterbi_rtbl'
                        );
    
  7. Calculate the best label sequence and save in the table extracted_best_labels.
    SELECT vcrf_label( 'test_segmenttbl',
                       'viterbi_mtbl',
                       'viterbi_rtbl',
                       'crf_label',
                       'extracted_best_labels'
                     );
    
    View the best labels.
    SELECT * FROM extracted_best_labels;
    
    Result:
     doc_id | start_pos |  seg_text   | label | id | prob  
     -------+-----------+-------------+-------+----+-------
          1 |         2 | Friday      | NNP   | 14 | 9e-06
          1 |         6 | Ford        | NNP   | 14 | 9e-06
          1 |        12 | Jaguar      | NNP   | 14 | 9e-06
          1 |         3 | prompted    | VBD   | 28 | 9e-06
          1 |         8 | intensify   | NN    | 12 | 9e-06
          1 |        14 | which       | NN    | 12 | 9e-06
          1 |        18 | defensive   | NN    | 12 | 9e-06
          1 |        21 | GM          | NN    | 12 | 9e-06
          1 |        22 | .           | .     | 44 | 9e-06
          1 |         1 | collapse    | CC    |  1 | 9e-06
          1 |         7 | would       | POS   | 17 | 9e-06
    ...
    

Technical Background

Specifically, a linear-chain CRF is a distribution defined by

\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]

where

A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).

The log-likelihood is defined as

\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]

and the zero of its gradient

\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]

is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.

\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:

\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]

\[ Z_\lambda(x) = \alpha_n.1^T \]

where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by

\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]

\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]

To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:

\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]

\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]

Literature
[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf

[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field

[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/

[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html

[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm

[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782

[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

Related Topics

File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)