Contents

Warning: This MADlib method is still in early stage development. There may be some issues that will be addressed in a future version. Interface and implementation is subject to change.

A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.

Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:

Edge Feature: transition feature that encodes the transition feature weight from current label to next label.
Start Feature: fired when the current token is the first token in a sequence.
End Feature: fired when the current token is the last token in a sequence.
Word Feature: fired when the current token is observed in the trained dictionary.
Unknown Feature: fired when the current token is not observed in the trained dictionary for at least a certain number of times (default 1).
Regex Feature: fired when the current token can be matched by a regular expression.

A Viterbi implementation is also provided to get the best label sequence and the conditional probability $ \Pr( \text{best label sequence} \mid \text{sequence}) $.

Training Function: Get number of iterations and weights for features:

lincrf( source, 
        sparse_R, 
        dense_M, 
        sparse_M, 
        featureSize, 
        tagSize, 
        featureset, 
        crf_feature,
        maxNumIterations
      )

Arguments

source: Name of the source relation containing the training data
sparse_R: Name of the sparse single state feature column (of type DOUBLE PRECISION[])
dense_M: Name of the dense two state feature column (of type DOUBLE PRECISION[])
sparse_M: Name of the sparse two state feature column (of type DOUBLE PRECISION[])
featureSize: Name of feature size column (of type DOUBLE PRECISION)
tagSize: The number of tags in the tag set
featureset: The unique feature set
crf_feature: The Name of output feature table
maxNumIterations: The maximum number of iterations

The features and weights are stored in the table named by crf_feature. This function returns a composite value containing the following columns:

coef	FLOAT8[]. Array of coefficients
log_likelihood	FLOAT8. Log-likelihood
num_iterations	INTEGER. The number of iterations before the algorithm terminated

Using CRF

Generate text features, calculate their weights, and output the best label sequence for test data:

Create tables to store the input data, intermediate data, and output data. Also import the training data to the database.
```
    SELECT madlib.crf_train_data( '/path/to/data');
    
```

Generate text analytics features for the training data.

SELECT madlib.crf_train_fgen(
         'segmenttbl',
         'regextbl',
         'dictionary',
         'featuretbl',
         'featureset');

Use linear-chain CRF for training.

SELECT madlib.lincrf(
         'source',
         'sparse_r',
         'dense_m',
         'sparse_m',
         'f_size',
         tag_size,
         'feature_set',
         'featureWeights',
         'maxNumIterations');

Import CRF model to the database. Also load the CRF testing data to the database.
```
SELECT madlib.crf_test_data(
         '/path/to/data');
```
Generate text analytics features for the testing data.
```
SELECT madlib.crf_test_fgen(
         'segmenttbl',
         'dictionary',
         'labeltbl',
         'regextbl',
         'featuretbl',
         'viterbi_mtbl',
         'viterbi_rtbl');
```
'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
Run the Viterbi function to get the best label sequence and the conditional probability $ \Pr( \text{best label sequence} \mid \text{sequence}) $.
```
SELECT madlib.vcrf_label(
         'segmenttbl',
         'viterbi_mtbl',
         'viterbi_rtbl',
         'labeltbl',
         'resulttbl');
```

Input

User-provided input:
The user is expected to at least provide the label table, the regular expression table, and the segment table:
```
{TABLE|VIEW} labelTableName (
    ...
    id INTEGER,
    label TEXT,
    ...
)
```
where id is a unique ID for the label and label is the label name.
```
{TABLE|VIEW} regexTableName (
    ...
    pattern TEXT,
    name TEXT,
    ...
)
```
where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
```
{TABLE|VIEW} segmentTableName (
    ...
    start_pos INTEGER,
    doc_id INTEGER,
    seg_text TEXT,
    label INTEGER,
    max_pos INTEGER,
    ...
)
```
where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.
Training (lincrf) input:
The feature table used for training is expected to be of the following form (this table can also be generated by crf_train_fgen):
```
{TABLE|VIEW} featureTableName (
    ...
    doc_id INTEGER,
    f_size INTEGER,
    sparse_r FLOAT8[],
    dense_m FLOAT8[],
    sparse_m FLOAT8[],
    ...
)
```
where
- doc_id is a unique ID for the sequence
- f_size is the number of features
- sparse_r is the array union of (previous label, label, feature index, start position, training existance indicator) of individal single-state features (e.g. word features, regex features) ordered by their start positon
- dense_m is the array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position
- sparse_m is the array union of (feature index, previous label, label) of edge features ordered by feature index. Edge features were split into dense_m and sparse_m for performance reasons.

The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):

{TABLE|VIEW} featureSetName (
    ...
    f_index INTEGER,
    f_name TEXT,
    feature_labels INTEGER[],
    ...
)

where

f_index is a unique ID for the feature
f_name is the feature name
feature_labels is an array representing {previous label, label}.

The empty feature weight table (which will be populated after training) is expected to be of the following form:

{TABLE|VIEW} featureWeightsName (
    ...
    f_index INTEGER,
    f_name TEXT,
    previous_label INTEGER,
    label INTEGER,
    weight FLOAT8,
    ...
)

Examples

This example uses a trivial training and test data set.

Load the label table, the regular expressions table, and the training segment table:

 
SELECT * FROM crf_label;

Result:

 id | label 
 ---+-------
  1 | CD
 13 | NNP
 15 | PDT
 17 | PRP
 29 | VBN
 31 | VBZ
 33 | WP
 35 | WRB
...

The regular expressions table:

SELECT * from crf_regex;

    pattern    |         name         
 --------------+----------------------
 ^.+ing$       | endsWithIng
 ^[A-Z][a-z]+$ | InitCapital
 ^[A-Z]+$      | isAllCapital
 ^.*[0-9]+.*$  | containsDigit
...

The training segment table:

SELECT * from train_segmenttbl;

 start_pos | doc_id |  seg_text  | label | max_pos
 ----------+--------+------------+-------+---------
         8 |      1 | alliance   |    11 |      26
        10 |      1 | Ford       |    13 |      26
        12 |      1 | that       |     5 |      26
        24 |      1 | likely     |     6 |      26
        26 |      1 | .          |    43 |      26
         8 |      2 | interest   |    11 |      10
        10 |      2 | .          |    43 |      10
         9 |      1 | after      |     5 |      26
        11 |      1 | concluded  |    27 |      26
        23 |      1 | the        |     2 |      26
        25 |      1 | return     |    11 |      26
         9 |      2 | later      |    19 |      10
...

Create the (empty) dictionary table, feature table, and feature set:

CREATE TABLE crf_dictionary(token text,total integer);
CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);

Generate the training features:

SELECT crf_train_fgen( 'train_segmenttbl', 
                       'crf_regex', 
                       'crf_dictionary', 
                       'train_featuretbl',
                       'train_featureset'
                     );
SELECT * from crf_dictionary;

Result:

   token    | total 
 -----------+-------
 talks      |     1
 that       |     1
 would      |     1
 alliance   |     1
 Saab       |     2
 cost       |     1
 after      |     1
 operations |     1
...

SELECT * from train_featuretbl;

Result:

 doc_id | f_size |            sparse_r           |             dense_m             |       sparse_m
 -------+--------+-------------------------------+---------------------------------+-----------------------
      2 |     87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
      1 |     87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}

SELECT * from train_featureset;

 f_index |    f_name     | feature 
 --------+---------------+---------
       1 | R_endsWithED  | {-1,29}
      13 | W_outweigh    | {-1,26}
      29 | U             | {-1,5}
      31 | U             | {-1,29}
      33 | U             | {-1,12}
      35 | W_a           | {-1,2}
      37 | W_possible    | {-1,6}
      15 | W_signaled    | {-1,29}
      17 | End.          | {-1,43}
      49 | W_'s          | {-1,16}
      63 | W_acquire     | {-1,26}
      51 | E.            | {26,2}
      69 | E.            | {29,17}
      71 | E.            | {2,11}
      83 | W_the         | {-1,2}
      85 | E.            | {16,11}
       4 | W_return      | {-1,11}
...

Create the (empty) feature weight table:

CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);

Train using linear CRF:

SELECT lincrf( 'train_featuretbl',
               'sparse_r',
               'dense_m',
               'sparse_m',
               'f_size',45, 
               'train_featureset',
               'train_crf_feature', 
               20
             );

 lincrf 
 -------
     20

View the feature weight table.

SELECT * from train_crf_feature;

Result:

 id |     name      | prev_label_id | label_id |      weight       
 ---+---------------+---------------+----------+-------------------
  1 | R_endsWithED  |            -1 |       29 |  1.54128249293937
 13 | W_outweigh    |            -1 |       26 |  1.70691232223653
 29 | U             |            -1 |        5 |  1.40708515869008
 31 | U             |            -1 |       29 | 0.830356200936407
 33 | U             |            -1 |       12 | 0.769587378281239
 35 | W_a           |            -1 |        2 |  2.68470625883726
 37 | W_possible    |            -1 |        6 |  3.41773107604468
 15 | W_signaled    |            -1 |       29 |  1.68187039165771
 17 | End.          |            -1 |       43 |  3.07687845517082
 49 | W_'s          |            -1 |       16 |  2.61430312229883
 63 | W_acquire     |            -1 |       26 |  1.67247047385797
 51 | E.            |            26 |        2 |   3.0114240119435
 69 | E.            |            29 |       17 |  2.82385531733866
 71 | E.            |             2 |       11 |  3.00970493772732
 83 | W_the         |            -1 |        2 |  2.58742315259326
...

To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.

SELECT * from test_segmenttbl;

Result:

 start_pos | doc_id |  seg_text   | max_pos 
 ----------+--------+-------------+---------
         1 |      1 | collapse    |      22
        13 |      1 | ,           |      22
        15 |      1 | is          |      22
        17 |      1 | a           |      22
         4 |      1 | speculation |      22
         6 |      1 | Ford        |      22
        18 |      1 | defensive   |      22
        20 |      1 | with        |      22
...

SELECT crf_test_fgen( 'test_segmenttbl',
                      'crf_dictionary',
                      'crf_label',
                      'crf_regex',
                      'train_crf_feature',
                      'viterbi_mtbl',
                      'viterbi_rtbl'
                    );

Calculate the best label sequence and save in the table extracted_best_labels.

SELECT vcrf_label( 'test_segmenttbl',
                   'viterbi_mtbl',
                   'viterbi_rtbl',
                   'crf_label',
                   'extracted_best_labels'
                 );

View the best labels.

SELECT * FROM extracted_best_labels;

Result:

 doc_id | start_pos |  seg_text   | label | id | prob  
 -------+-----------+-------------+-------+----+-------
      1 |         2 | Friday      | NNP   | 14 | 9e-06
      1 |         6 | Ford        | NNP   | 14 | 9e-06
      1 |        12 | Jaguar      | NNP   | 14 | 9e-06
      1 |         3 | prompted    | VBD   | 28 | 9e-06
      1 |         8 | intensify   | NN    | 12 | 9e-06
      1 |        14 | which       | NN    | 12 | 9e-06
      1 |        18 | defensive   | NN    | 12 | 9e-06
      1 |        21 | GM          | NN    | 12 | 9e-06
      1 |        22 | .           | .     | 44 | 9e-06
      1 |         1 | collapse    | CC    |  1 | 9e-06
      1 |         7 | would       | POS   | 17 | 9e-06
...

Technical Background

Specifically, a linear-chain CRF is a distribution defined by

\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]

where

$ F_m(\boldsymbol x, \boldsymbol y) = \sum_{i=1}^n f_m(y_i,y_{i-1},x_i) $ is a global feature function that is a sum along a sequence $ \boldsymbol x $ of length $ n $
$ f_m(y_i,y_{i-1},x_i) $ is a local feature function dependent on the current token label $ y_i $, the previous token label $ y_{i-1} $, and the observation $ x_i $
$ \lambda_m $ is the corresponding feature weight
$ Z_\lambda(\boldsymbol x) $ is an instance-specific normalizer
\[ Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')} \]

A linear-chain CRF estimates the weights $ \lambda_m $ by maximizing the log-likelihood of a given training set $ T=\{(x_k,y_k)\}_{k=1}^N $.

The log-likelihood is defined as

\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]

and the zero of its gradient

\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]

is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.

$E_{p_\lambda(Y|x)}[F(x,Y)]$ is found by using a variant of the forward-backward algorithm:

\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]

\[ Z_\lambda(x) = \alpha_n.1^T \]

where $\alpha_i$ and $ \beta_i$ are the forward and backward state cost vectors defined by

\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]

\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]

To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:

\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]

\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]

Literature: [1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf

[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field

[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/

[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html

[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm

[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782

[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

Related Topics

File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)