MADlib
1.0 A newer version is available
User Documentation
|
Specifically, a linear-chain CRF is a distribution defined by
\[ p_\lambda(\boldsymbol y | \boldsymbol x) = \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)} \,. \]
where
\[ Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')} \]
A linear-chain CRF estimates the weights \( \lambda_m \) by maximizing the log-likelihood of a given training set \( T=\{(x_k,y_k)\}_{k=1}^N \).
The log-likelihood is defined as
\[ \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] \]
and the zero of its gradient
\[ \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] \]
is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.
\(E_{p_\lambda(Y|x)}[F(x,Y)]\) is found by using a variant of the forward-backward algorithm:
\[ E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y) = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)} \]
\[ Z_\lambda(x) = \alpha_n.1^T \]
where \(\alpha_i\) and \( \beta_i\) are the forward and backward state cost vectors defined by
\[ \alpha_i = \begin{cases} \alpha_{i-1}M_i, & 0<i<=n\\ 1, & i=0 \end{cases}\\ \]
\[ \beta_i^T = \begin{cases} M_{i+1}\beta_{i+1}^T, & 1<=i<n\\ 1, & i=n \end{cases} \]
To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:
\[ \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2} \]
\[ \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2} \]
Feature extraction modules are provided for text-analysis tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:
A Viterbi implementation is also provided to get the best label sequence and the conditional probability \( \Pr( \text{best label sequence} \mid \text{sequence}) \).
For a full example of how to use the MADlib CRF modules for a text analytics application, see the "Example" section below.
{TABLE|VIEW} labelTableName ( ... id INTEGER, label TEXT, ... )where id is a unique ID for the label and label is the label name.
{TABLE|VIEW} regexTableName ( ... pattern TEXT, name TEXT, ... )where pattern is a regular expression pattern (e.g. '^.+ing$') and name is a name for the regular expression pattern (e.g. 'endsWithIng').
{TABLE|VIEW} segmentTableName ( ... start_pos INTEGER, doc_id INTEGER, seg_text TEXT, label INTEGER, max_pos INTEGER, ... )where start_pos is the position of the word in the sequence, doc_id is a unique ID for the sequence, seg_text is the word, label is the label for the word, and max_pos is the length of the sequence.
{TABLE|VIEW} featureTableName ( ... doc_id INTEGER, f_size INTEGER, sparse_r FLOAT8[], dense_m FLOAT8[], sparse_m FLOAT8[], ... )where
The set of features used for training is expected to be of the following form (also can be generated by crf_train_fgen):
{TABLE|VIEW} featureSetName ( ... f_index INTEGER, f_name TEXT, feature_labels INTEGER[], ... )
where
The empty feature weight table (which will be populated after training) is expected to be of the following form:
{TABLE|VIEW} featureWeightsName ( ... f_index INTEGER, f_name TEXT, previous_label INTEGER, label INTEGER, weight FLOAT8, ... )
Get number of iterations and weights for features:
SELECT * FROM lincrf( 'featureTableName', 'sparse_r', 'dense_m','sparse_m', 'f_size', tag_size, 'feature_set', 'featureWeightsName' [, maxNumberOfIterations ] ] );
where tag_size is the total number of labels.
Output:
lincrf ----------------- [number of iterations]
featureWeightsName:
id | name | prev_label_id | label_id | weight ----+----------------+---------------+----------+-------------------
SELECT madlib.crf_train_data( '/path/to/data');
SELECT madlib.crf_train_fgen( 'segmenttbl', 'regextbl', 'dictionary', 'featuretbl', 'featureset');
SELECT madlib.lincrf( 'source', 'sparse_r', 'dense_m', 'sparse_m', 'f_size', tag_size, 'feature_set', 'featureWeights', 'maxNumIterations');
SELECT madlib.crf_test_data( '/path/to/data');
SELECT madlib.crf_test_fgen( 'segmenttbl', 'dictionary', 'labeltbl', 'regextbl', 'featuretbl', 'viterbi_mtbl', 'viterbi_rtbl');'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
SELECT madlib.vcrf_label( 'segmenttbl', 'viterbi_mtbl', 'viterbi_rtbl', 'labeltbl', 'resulttbl');
sql> SELECT * FROM crf_label; id | label ----+------- 1 | CD 13 | NNP 15 | PDT 17 | PRP 29 | VBN 31 | VBZ 33 | WP 35 | WRB ... sql> SELECT * from crf_regex; pattern | name ---------------+---------------------- ^.+ing$ | endsWithIng ^[A-Z][a-z]+$ | InitCapital ^[A-Z]+$ | isAllCapital ^.*[0-9]+.*$ | containsDigit ... sql> SELECT * from train_segmenttbl; start_pos | doc_id | seg_text | label | max_pos -----------+--------+------------+-------+--------- 8 | 1 | alliance | 11 | 26 10 | 1 | Ford | 13 | 26 12 | 1 | that | 5 | 26 24 | 1 | likely | 6 | 26 26 | 1 | . | 43 | 26 8 | 2 | interest | 11 | 10 10 | 2 | . | 43 | 10 9 | 1 | after | 5 | 26 11 | 1 | concluded | 27 | 26 23 | 1 | the | 2 | 26 25 | 1 | return | 11 | 26 9 | 2 | later | 19 | 10 ...
sql> CREATE TABLE crf_dictionary(token text,total integer); sql> CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]); sql> CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
sql> SELECT crf_train_fgen('train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl','train_featureset'); sql> SELECT * from crf_dictionary; token | total ------------+------- talks | 1 that | 1 would | 1 alliance | 1 Saab | 2 cost | 1 after | 1 operations | 1 ... sql> SELECT * from train_featuretbl; doc_id | f_size | sparse_r | dense_m | sparse_m --------+--------+-------------------------------+---------------------------------+----------------------- 2 | 87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...} 1 | 87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...} sql> SELECT * from train_featureset; f_index | f_name | feature ---------+---------------+--------- 1 | R_endsWithED | {-1,29} 13 | W_outweigh | {-1,26} 29 | U | {-1,5} 31 | U | {-1,29} 33 | U | {-1,12} 35 | W_a | {-1,2} 37 | W_possible | {-1,6} 15 | W_signaled | {-1,29} 17 | End. | {-1,43} 49 | W_'s | {-1,16} 63 | W_acquire | {-1,26} 51 | E. | {26,2} 69 | E. | {29,17} 71 | E. | {2,11} 83 | W_the | {-1,2} 85 | E. | {16,11} 4 | W_return | {-1,11} ...
sql> CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
sql> SELECT lincrf('train_featuretbl','sparse_r','dense_m','sparse_m','f_size',45, 'train_featureset','train_crf_feature', 20); lincrf -------- 20 sql> SELECT * from train_crf_feature; id | name | prev_label_id | label_id | weight ----+---------------+---------------+----------+------------------- 1 | R_endsWithED | -1 | 29 | 1.54128249293937 13 | W_outweigh | -1 | 26 | 1.70691232223653 29 | U | -1 | 5 | 1.40708515869008 31 | U | -1 | 29 | 0.830356200936407 33 | U | -1 | 12 | 0.769587378281239 35 | W_a | -1 | 2 | 2.68470625883726 37 | W_possible | -1 | 6 | 3.41773107604468 15 | W_signaled | -1 | 29 | 1.68187039165771 17 | End. | -1 | 43 | 3.07687845517082 49 | W_'s | -1 | 16 | 2.61430312229883 63 | W_acquire | -1 | 26 | 1.67247047385797 51 | E. | 26 | 2 | 3.0114240119435 69 | E. | 29 | 17 | 2.82385531733866 71 | E. | 2 | 11 | 3.00970493772732 83 | W_the | -1 | 2 | 2.58742315259326 ...
sql> SELECT * from test_segmenttbl; start_pos | doc_id | seg_text | max_pos -----------+--------+-------------+--------- 1 | 1 | collapse | 22 13 | 1 | , | 22 15 | 1 | is | 22 17 | 1 | a | 22 4 | 1 | speculation | 22 6 | 1 | Ford | 22 18 | 1 | defensive | 22 20 | 1 | with | 22 ... sql> SELECT crf_test_fgen('test_segmenttbl','crf_dictionary','crf_label','crf_regex','train_crf_feature','viterbi_mtbl','viterbi_rtbl');
sql> SELECT vcrf_label('test_segmenttbl','viterbi_mtbl','viterbi_rtbl','crf_label','extracted_best_labels'); sql> SELECT * FROM extracted_best_labels; doc_id | start_pos | seg_text | label | id | prob --------+-----------+-------------+-------+----+------- 1 | 2 | Friday | NNP | 14 | 9e-06 1 | 6 | Ford | NNP | 14 | 9e-06 1 | 12 | Jaguar | NNP | 14 | 9e-06 1 | 3 | prompted | VBD | 28 | 9e-06 1 | 8 | intensify | NN | 12 | 9e-06 1 | 14 | which | NN | 12 | 9e-06 1 | 18 | defensive | NN | 12 | 9e-06 1 | 21 | GM | NN | 12 | 9e-06 1 | 22 | . | . | 44 | 9e-06 1 | 1 | collapse | CC | 1 | 9e-06 1 | 7 | would | POS | 17 | 9e-06 ...(Note that this example was done on a trivial training and test data set.)
[2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field
[3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/
[4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html
[5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm
[6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782
[7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html