A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special type of CRF that assumes the current state depends only on the previous state.
Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:
A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).
lincrf( source, sparse_R, dense_M, sparse_M, featureSize, tagSize, featureset, crf_feature, maxNumIterations )
Arguments
The features and weights are stored in the table named by crf_feature. This function returns a composite value containing the following columns:
coef | FLOAT8[]. Array of coefficients |
---|---|
log_likelihood | FLOAT8. Log-likelihood |
num_iterations | INTEGER. The number of iterations before the algorithm terminated |
Generate text features, calculate their weights, and output the best label sequence for test data:
SELECT madlib.crf_train_data( '/path/to/data');
SELECT madlib.crf_train_fgen( 'segmenttbl', 'regextbl', 'dictionary', 'featuretbl', 'featureset');
SELECT madlib.lincrf( 'source', 'sparse_r', 'dense_m', 'sparse_m', 'f_size', tag_size, 'feature_set', 'featureWeights', 'maxNumIterations');
SELECT madlib.crf_test_data( '/path/to/data');
SELECT madlib.crf_test_fgen( 'segmenttbl', 'dictionary', 'labeltbl', 'regextbl', 'featuretbl', 'viterbi_mtbl', 'viterbi_rtbl');'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
SELECT madlib.vcrf_label( 'segmenttbl', 'viterbi_mtbl', 'viterbi_rtbl', 'labeltbl', 'resulttbl');
{TABLE|VIEW} labelTableName ( ... id INTEGER, label TEXT, ... )where id is a unique ID for the label and label is the label name.
{TABLE|VIEW} regexTableName ( ... pattern TEXT, name TEXT, ... )where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
{TABLE|VIEW} segmentTableName ( ... start_pos INTEGER, doc_id INTEGER, seg_text TEXT, label INTEGER, max_pos INTEGER, ... )where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.
{TABLE|VIEW} featureTableName ( ... doc_id INTEGER, f_size INTEGER, sparse_r FLOAT8[], dense_m FLOAT8[], sparse_m FLOAT8[], ... )where
The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):
{TABLE|VIEW} featureSetName ( ... f_index INTEGER, f_name TEXT, feature_labels INTEGER[], ... )
where
The empty feature weight table (which will be populated after training) is expected to be of the following form:
{TABLE|VIEW} featureWeightsName ( ... f_index INTEGER, f_name TEXT, previous_label INTEGER, label INTEGER, weight FLOAT8, ... )
SELECT * FROM crf_label;Result:
id | label ---+------- 1 | CD 13 | NNP 15 | PDT 17 | PRP 29 | VBN 31 | VBZ 33 | WP 35 | WRB ...The regular expressions table:
SELECT * from crf_regex;
pattern | name --------------+---------------------- ^.+ing$ | endsWithIng ^[A-Z][a-z]+$ | InitCapital ^[A-Z]+$ | isAllCapital ^.*[0-9]+.*$ | containsDigit ...The training segment table:
SELECT * from train_segmenttbl;
start_pos | doc_id | seg_text | label | max_pos ----------+--------+------------+-------+--------- 8 | 1 | alliance | 11 | 26 10 | 1 | Ford | 13 | 26 12 | 1 | that | 5 | 26 24 | 1 | likely | 6 | 26 26 | 1 | . | 43 | 26 8 | 2 | interest | 11 | 10 10 | 2 | . | 43 | 10 9 | 1 | after | 5 | 26 11 | 1 | concluded | 27 | 26 23 | 1 | the | 2 | 26 25 | 1 | return | 11 | 26 9 | 2 | later | 19 | 10 ...
CREATE TABLE crf_dictionary(token text,total integer); CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]); CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
SELECT crf_train_fgen( 'train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl', 'train_featureset' ); SELECT * from crf_dictionary;Result:
token | total -----------+------- talks | 1 that | 1 would | 1 alliance | 1 Saab | 2 cost | 1 after | 1 operations | 1 ...
SELECT * from train_featuretbl;Result:
doc_id | f_size | sparse_r | dense_m | sparse_m -------+--------+-------------------------------+---------------------------------+----------------------- 2 | 87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...} 1 | 87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
SELECT * from train_featureset;
f_index | f_name | feature --------+---------------+--------- 1 | R_endsWithED | {-1,29} 13 | W_outweigh | {-1,26} 29 | U | {-1,5} 31 | U | {-1,29} 33 | U | {-1,12} 35 | W_a | {-1,2} 37 | W_possible | {-1,6} 15 | W_signaled | {-1,29} 17 | End. | {-1,43} 49 | W_'s | {-1,16} 63 | W_acquire | {-1,26} 51 | E. | {26,2} 69 | E. | {29,17} 71 | E. | {2,11} 83 | W_the | {-1,2} 85 | E. | {16,11} 4 | W_return | {-1,11} ...
CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
SELECT lincrf( 'train_featuretbl', 'sparse_r', 'dense_m', 'sparse_m', 'f_size',45, 'train_featureset', 'train_crf_feature', 20 );
lincrf ------- 20View the feature weight table.
SELECT * from train_crf_feature;Result:
id | name | prev_label_id | label_id | weight ---+---------------+---------------+----------+------------------- 1 | R_endsWithED | -1 | 29 | 1.54128249293937 13 | W_outweigh | -1 | 26 | 1.70691232223653 29 | U | -1 | 5 | 1.40708515869008 31 | U | -1 | 29 | 0.830356200936407 33 | U | -1 | 12 | 0.769587378281239 35 | W_a | -1 | 2 | 2.68470625883726 37 | W_possible | -1 | 6 | 3.41773107604468 15 | W_signaled | -1 | 29 | 1.68187039165771 17 | End. | -1 | 43 | 3.07687845517082 49 | W_'s | -1 | 16 | 2.61430312229883 63 | W_acquire | -1 | 26 | 1.67247047385797 51 | E. | 26 | 2 | 3.0114240119435 69 | E. | 29 | 17 | 2.82385531733866 71 | E. | 2 | 11 | 3.00970493772732 83 | W_the | -1 | 2 | 2.58742315259326 ...
SELECT * from test_segmenttbl;Result:
start_pos | doc_id | seg_text | max_pos ----------+--------+-------------+--------- 1 | 1 | collapse | 22 13 | 1 | , | 22 15 | 1 | is | 22 17 | 1 | a | 22 4 | 1 | speculation | 22 6 | 1 | Ford | 22 18 | 1 | defensive | 22 20 | 1 | with | 22 ...
SELECT crf_test_fgen( 'test_segmenttbl', 'crf_dictionary', 'crf_label', 'crf_regex', 'train_crf_feature', 'viterbi_mtbl', 'viterbi_rtbl' );
extracted_best_labels
. SELECT vcrf_label( 'test_segmenttbl', 'viterbi_mtbl', 'viterbi_rtbl', 'crf_label', 'extracted_best_labels' );View the best labels.
SELECT * FROM extracted_best_labels;Result:
doc_id | start_pos | seg_text | label | id | prob -------+-----------+-------------+-------+----+------- 1 | 2 | Friday | NNP | 14 | 9e-06 1 | 6 | Ford | NNP | 14 | 9e-06 1 | 12 | Jaguar | NNP | 14 | 9e-06 1 | 3 | prompted | VBD | 28 | 9e-06 1 | 8 | intensify | NN | 12 | 9e-06 1 | 14 | which | NN | 12 | 9e-06 1 | 18 | defensive | NN | 12 | 9e-06 1 | 21 | GM | NN | 12 | 9e-06 1 | 22 | . | . | 44 | 9e-06 1 | 1 | collapse | CC | 1 | 9e-06 1 | 7 | would | POS | 17 | 9e-06 ...
Specifically, a linear-chain CRF is a distribution defined by
\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]
where
\[ Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')} \]
A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).
The log-likelihood is defined as
\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]
and the zero of its gradient
\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]
is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.
\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:
\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]
\[ Z_\lambda(x) = \alpha_n.1^T \]
where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by
\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]
\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]
To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:
\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]
\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]
[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field
[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/
[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html
[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm
[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782
[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html
File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)