2.1.0
User Documentation for Apache MADlib
Conditional Random Field

A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.

Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:

A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).

Following steps are required for CRF Learning and Inference:

  1. Training Feature Generation
  2. CRF Training
  3. Testing Feature Generation
  4. Inference using Viterbi

Training Feature Generation
The function takes train_segment_tbl and regex_tbl as input and does feature generation generating three tables dictionary_tbl, train_feature_tbl and train_featureset_tbl, that are required as an input for CRF training.
crf_train_fgen(train_segment_tbl,
               regex_tbl,
               label_tbl,
               dictionary_tbl,
               train_feature_tbl,
               train_featureset_tbl)
Arguments
train_segment_tbl
TEXT. Name of the training segment table. The table is expected to have the following columns:
doc_id INTEGER. Document id column
start_pos INTEGER. Index of a particular term in the respective document
seg_text TEXT. Term at the respective start_pos in the document
label INTEGER. Label id for the term corresponding to the actual label from label_tbl
regex_tbl
TEXT. Name of the regular expression table. The table is expected to have the following columns:
pattern TEXT. Regular Expression
name TEXT. Regular Expression name
label_tbl
TEXT. Name of the table containing unique labels and their id's. The table is expected to have the following columns:
id INTEGER. Unique label id. NOTE: Must range from 0 to total number of labels in the table - 1.
label TEXT. Label name
dictionary_tbl
TEXT. Name of the dictionary table to be created containing unique terms along with their counts. The table will have the following columns:
token TEXT. Contains all the unique terms found in train_segment_tbl
total INTEGER. Respective counts for the terms
train_feature_tbl

TEXT. Name of the training feature table to be created. The table will have the following columns:

doc_id INTEGER. Document id
f_size INTEGER. Feature set size. This value will be same for all the tuples in the table
sparse_r DOUBLE PRECISION[]. Array union of individual single state features (previous label, label, feature index, start position, training existance indicator), ordered by their start position.
dense_m DOUBLE PRECISION[]. Array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position.
sparse_m DOUBLE PRECISION[]. Array union of (feature index, previous label, label) of edge features ordered by feature index.

train_featureset_tbl
TEXT. Name of the table to be created containing distinct featuresets generated from training feature extraction. The table will have the following columns:
f_index INTEGER. Column containing distinct featureset ids
f_name TEXT. Feature name
feature ARRAY. Feature value. The value is of the form [L1, L2]
- If L1 = -1: represents single state feature with L2 being the current label id.
- If L1 != -1: represents transition feature with L1 be the previous label and L2 be the current label.

Linear Chain CRF Training Function
The function takes train_feature_tbl and train_featureset_tbl tables generated in the training feature generation steps as input along with other required parameters and produces two output tables crf_stats_tbl and crf_weights_tbl.
lincrf_train(train_feature_tbl,
             train_featureset_tbl,
             label_tbl,
             crf_stats_tbl,
             crf_weights_tbl
             max_iterations
            )

Arguments

train_feature_tbl

TEXT. Name of the feature table generated during training feature generation

train_featureset_tbl

TEXT. Name of the featureset table generated during training feature generation

label_tbl

TEXT. Name of the label table used

crf_stats_table
TEXT. Name of the table to be created containing statistics for CRF training. The table has the following columns:
coef DOUBLE PRECISION[]. Array of coefficients
log_likelihood DOUBLE. Log-likelihood
num_iterations INTEGER. The number of iterations at which the algorithm terminated
crf_weights_table

TEXT. Name of the table to be created creating learned feature weights. The table has the following columns:

id INTEGER. Feature set id
name TEXT. Feature name
prev_label_id INTEGER. Label for the previous token encountered
label_id INTEGER. Label of the token with the respective feature
weight DOUBLE PRECISION. Weight for the respective feature set

max_iterations
INTEGER. The maximum number of iterations

Testing Feature Generation
crf_test_fgen(test_segment_tbl,
              dictionary_tbl,
              label_tbl,
              regex_tbl,
              crf_weights_tbl,
              viterbi_mtbl,
              viterbi_rtbl
             )

Arguments

test_segment_tbl

TEXT. Name of the testing segment table. The table is expected to have the following columns:

doc_id INTEGER. Document id column
start_pos INTEGER. Index of a particular term in the respective document
seg_text TEXT. Term at the respective start_pos in the document

dictionary_tbl

TEXT. Name of the dictionary table created during training feature generation (crf_train_fgen)

label_tbl

TEXT. Name of the label table

regex_tbl

TEXT. Name of the regular expression table

crf_weights_tbl

TEXT. Name of the weights table generated during CRF training (lincrf_train)

viterbi_mtbl

TEXT. Name of the Viterbi M table to be created

viterbi_rtbl
TEXT. Name of the Viterbi R table to be created

Inference using Viterbi
vcrf_label(test_segment_tbl,
           viterbi_mtbl,
           viterbi_rtbl,
           label_tbl,
           result_tbl)
Arguments
test_segment_tbl
TEXT. Name of the testing segment table. For required table schema, please refer to arguments in previous section
viterbi_mtbl
TEXT. Name of the table viterbi_mtbl generated from testing feature generation crf_test_fgen.
viterbi_rtbl
TEXT. Name of the table viterbi_rtbl generated from testing feature generation crf_test_fgen.
label_tbl
TEXT. Name of the label table.
result_tbl
TEXT. Name of the result table to be created containing extracted best label sequences.

Using CRF

Generate text features, calculate their weights, and output the best label sequence for test data:

  1. Perform feature generation on training data i.e. train_segment_tbl generating train_feature_tbl and train_featureset_tbl.
    SELECT madlib.crf_train_fgen(
             'train_segment_tbl',
             'regex_tbl',
             'label_tbl',
             'dictionary_tbl',
             'train_feature_tbl',
             'train_featureset_tbl');
  2. Use linear-chain CRF for training providing train_feature_tbl and train_featureset_tbl generated from previous step as an input.
    SELECT madlib.lincrf_train(
             'train_feature_tbl',
             'train_featureset_tbl',
             'label_tbl',
             'crf_stats_tbl',
             'crf_weights_tbl',
             max_iterations);
  3. Perform feature generation on testing data test_segment_tbl generating viterbi_mtbl and viterbi_rtbl required for inferencing.
    SELECT madlib.crf_test_fgen(
             'test_segment_tbl',
             'dictionary_tbl',
             'label_tbl',
             'regex_tbl',
             'crf_weights_tbl',
             'viterbi_mtbl',
             'viterbi_rtbl');
  4. Run the Viterbi function to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).
    SELECT madlib.vcrf_label(
             'test_segment_tbl',
             'viterbi_mtbl',
             'viterbi_rtbl',
             'label_tbl',
             'result_tbl');

Examples
This example uses a trivial training and test data set.
  1. Load the label table, the regular expressions table, and the training segment table:
    CREATE TABLE crf_label (id integer,label character varying);
    INSERT INTO crf_label VALUES
    (0,'CC'),   (1,'CD'),  (2,'DT'),    (3,'EX'),   (4,'FW'),  (5,'IN'),   (6,'JJ'),  (7,'JJR'), (8,'JJS'),
    (9,'LS'),   (10,'MD'), (11,'NN'),   (12,'NNS'), (13,'NNP'),(14,'NNPS'),(15,'PDT'),(16,'POS'),(17,'PRP'),
    (18,'PRP$'),(19,'RB'), (20,'RBR'),  (21,'RBS'), (22,'RP'), (23,'SYM'), (24,'TO'), (25,'UH'), (26,'VB'),
    (27,'VBD'), (28,'VBG'),(29,'VBN'),  (30,'VBP'), (31,'VBZ'),(32,'WDT'), (33,'WP'), (34,'WP$'),(35,'WRB'),
    (36,'$'),   (37,'#'),  (38,''''''), (39,'``'),  (40,'('),  (41,')'),   (42,','),  (43,'.'),  (44,':');
    CREATE TABLE crf_regex (pattern text,name text);
    INSERT INTO crf_regex VALUES
    ('^[A-Z][a-z]+$','InitCapital'), ('^[A-Z]+$','isAllCapital'), ('^.*[0-9]+.*$','containsDigit'),
    ('^.+[.]$','endsWithDot'),       ('^.+[,]$','endsWithComma'), ('^.+er$','endsWithER'),
    ('^.+est$','endsWithEst'),       ('^.+ed$','endsWithED'),     ('^.+s$','endsWithS'),
    ('^.+ing$','endsWithIng'),       ('^.+ly$','endsWithly'),     ('^.+-.+$','isDashSeparatedWords'),
    ('^.*.*$','isEmailId');
    CREATE TABLE train_segmenttbl(start_pos integer,doc_id integer,seg_text text,label integer,max_pos integer);
    INSERT INTO train_segmenttbl VALUES
    (0,1,'confidence',11,36),  (1,1,'in',5,36),         (2,1,'the',2,36),         (3,1,'pound',11,36),
    (4,1,'is',31,36),          (5,1,'widely',19,36),    (6,1,'expected',29,36),   (7,1,'to',24,36),
    (8,1,'take',26,36),        (9,1,'another',2,36),    (10,1,'sharp',6,36),      (11,1,'dive',11,36),
    (12,1,'if',5,36),          (13,1,'trade',11,36),    (14,1,'figures',12,36),   (15,1,'for',5,36),
    (16,1,'september',13,36),  (17,1,',',42,36),        (18,1,'due',6,36),        (19,1,'for',5,36),
    (20,1,'release',11,36),    (21,1,'tomorrow',11,36), (22,1,',',42,36),         (23,1,'fail',26,36),
    (24,1,'to',24,36),         (25,1,'show',26,36),     (26,1,'a',2,36),          (27,1,'substantial',6,36),
    (28,1,'improvement',11,36),(29,1,'from',5,36),      (30,1,'july',13,36),      (31,1,'and',0,36),
    (32,1,'august',13,36),     (33,1,'''s',16,36),      (34,1,'near-record',6,36),(35,1,'deficits',12,36),
    (36,1,'.',43,36),          (0,2,'chancellor',13,26),(1,2,'of',5,26),          (2,2,'the',2,26),
    (3,2,'exchequer',13,26),   (4,2,'nigel',13,26),     (5,2,'lawson',13,26),     (6,2,'''s',16,26),
    (7,2,'restated',29,26),    (8,2,'commitment',11,26),(9,2,'to',24,26),         (10,2,'a',2,26),
    (11,2,'firm',11,26),       (12,2,'monetary',6,26),  (13,2,'policy',11,26),    (14,2,'has',31,26),
    (15,2,'helped',29,26),     (16,2,'to',24,26),       (17,2,'prevent',26,26),   (18,2,'a',2,26),
    (19,2,'freefall',11,26),   (20,2,'in',5,26),        (21,2,'sterling',11,26),  (22,2,'over',5,26),
    (23,2,'the',2,26),         (24,2,'past',6,26),      (25,2,'week',11,26),      (26,2,'.',43,26);
    
  2. Generate the training features:
    SELECT crf_train_fgen( 'train_segmenttbl',
                           'crf_regex',
                           'crf_label',
                           'crf_dictionary',
                           'train_featuretbl',
                           'train_featureset'
                         );
    SELECT * from crf_dictionary;
    
    Result:
        token    | total
     ------------+-------
     a           |     3
     and         |     1
     august      |     1
     chancellor  |     1
     dive        |     1
     exchequer   |     1
    ...
    
    SELECT * from train_featuretbl;
    
    Result:
     doc_id | f_size |            sparse_r          |              dense_m             |       sparse_m
     -------+--------+------------------------------+----------------------------------+-----------------------
          1 |    115 | {-1,11,82,0,1,-1,2,32,0,...} | {11,5,11,1,1,5,2,8,2,1,2,11,...} | {5,2,13,11,11,5,13,...}
          2 |    115 | {-1,19,35,0,0,-1,26,38,0,..} | {13,5,66,1,1,5,2,8,2,1,2,13,...} | {5,2,13,11,11,5,13,...}
    
    SELECT * from train_featureset;
    
     f_index |    f_name     | feature
     --------+---------------+---------
           6 | W_the         | {-1,2}
           9 | R_endsWithly  | {-1,19}
          14 | W_figures     | {-1,12}
          17 | W_helped      | {-1,29}
          25 | W_show        | {-1,26}
          28 | W_'s          | {-1,16}
          33 | W_chancellor  | {-1,13}
          43 | W_over        | {-1,5}
          52 | W_trade       | {-1,11}
          10 | W_july        | {-1,13}
          21 | W_substantial | {-1,6}
           5 | E.            | {2,13}
    ...
    
  3. Train using linear CRF:
    SELECT lincrf_train( 'train_featuretbl',
                         'train_featureset',
                         'crf_label',
                         'crf_stats_tbl',
                         'crf_weights_tbl',
                         20
                 );
    
                                    lincrf_train
     -----------------------------------------------------------------------------------
     CRF Train successful. Results stored in the specified CRF stats and weights table
     lincrf
    
    View the feature weight table.
    SELECT * from crf_weights_tbl;
    
    Result:
      id |     name      | prev_label_id | label_id |      weight
     ----+---------------+---------------+----------+-------------------
       4 | W_lawson      |            -1 |       13 |  1.73698153439171
       3 | End.          |            -1 |       43 |   3.3198742329636
       7 | W_has         |            -1 |       31 |  2.19831004450897
      24 | W_tomorrow    |            -1 |       11 |  3.34106414300743
      29 | W_.           |            -1 |       43 |   3.3198742329636
      34 | W_from        |            -1 |        5 |  2.80284597986744
      37 | W_august      |            -1 |       13 |  1.34455487966976
      39 | W_due         |            -1 |        6 |  3.39258895715363
      41 | W_exchequer   |            -1 |       13 |  1.82177698489335
    ...
    
  4. To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
    CREATE TABLE test_segmenttbl (start_pos integer,doc_id integer,seg_text text,max_pos integer);
    INSERT INTO test_segmenttbl VALUES
    (0,1,'chancellor',26),(1,1,'of',26),       (2,1,'the',26),      (3,1,'exchequer',26), (4,1,'nigel',26),
    (5,1,'lawson',26),    (6,1,'''s',26),      (7,1,'restated',26), (8,1,'commitment',26),(9,1,'to',26),
    (10,1,'a',26),        (11,1,'firm',26),    (12,1,'monetary',26),(13,1,'policy',26),   (14,1,'has',26),
    (15,1,'helped',26),   (16,1,'to',26),      (17,1,'prevent',26), (18,1,'a',26),        (19,1,'freefall',26),
    (20,1,'in',26),       (21,1,'sterling',26),(22,1,'over',26),    (23,1,'the',26),      (24,1,'past',26),
    (25,1,'week',26),     (26,1,'.',26),       (0,2,'but',28),      (1,2,'analysts',28),  (2,2,'reckon',28),
    (3,2,'underlying',28),(4,2,'support',28),  (5,2,'for',28),      (6,2,'sterling',28),  (7,2,'has',28),
    (8,2,'been',28),      (9,2,'eroded',28),   (10,2,'by',28),      (11,2,'the',28),      (12,2,'chancellor',28),
    (13,2,'''s',28),      (14,2,'failure',28), (15,2,'to',28),      (16,2,'announce',28), (17,2,'any',28),
    (18,2,'new',28),      (19,2,'policy',28),  (20,2,'measures',28),(21,2,'in',28),       (22,2,'his',28),
    (23,2,'mansion',28),  (24,2,'house',28),   (25,2,'speech',28),  (26,2,'last',28),     (27,2,'thursday',28),
    (28,2,'.',28),        (0,3,'his',4),       (1,3,'actions',4),   (2,3,'prevent',4),    (3,3,'disaster',4),
    (4,3,'.',4);
    
    SELECT crf_test_fgen( 'test_segmenttbl',
                          'crf_dictionary',
                          'crf_label',
                          'crf_regex',
                          'crf_weights_tbl',
                          'viterbi_mtbl',
                          'viterbi_rtbl'
                        );
    
  5. Calculate the best label sequence and save in the table extracted_best_labels.
    SELECT vcrf_label( 'test_segmenttbl',
                       'viterbi_mtbl',
                       'viterbi_rtbl',
                       'crf_label',
                       'extracted_best_labels'
                     );
    
    View the best labels.
    SELECT * FROM extracted_best_labels;
    
    Result:
     doc_id | start_pos |   seg_text    | label | id | max_pos |   prob
     -------+-----------+---------------+-------+----+---------+----------
          1 |         4 | nigel         | NNP   | 13 |      26 | 0.387118
          1 |         5 | lawson        | NNP   | 13 |      26 | 0.387118
          1 |         7 | restated      | VBN   | 29 |      26 | 0.387118
          1 |         8 | commitment    | NN    | 11 |      26 | 0.387118
    ...
          3 |         0 | his           | NNP   | 13 |       4 | 0.047757
          3 |         2 | prevent       | JJ    |  6 |       4 | 0.047757
          3 |         4 | .             | .     | 43 |       4 | 0.047757
    ...
    

Technical Background

Specifically, a linear-chain CRF is a distribution defined by

\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]

where

A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).

The log-likelihood is defined as

\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]

and the zero of its gradient

\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]

is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.

\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:

\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]

\[ Z_\lambda(x) = \alpha_n.1^T \]

where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by

\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]

\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]

To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:

\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]

\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]

Literature
[1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf

[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field

[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/

[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html

[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm

[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782

[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

Related Topics

File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)