User Documentation
 All Files Functions Groups
crf.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file crf.sql_in
4  *
5  * @brief SQL functions for conditional random field
6  * @date July 2012
7  *
8  * @sa For a brief introduction to conditional random field, see the
9  * module description \ref grp_crf.
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 @addtogroup grp_crf
17 
18 \warning <em> This MADlib method is still in early stage development. There may be some
19 issues that will be addressed in a future version. Interface and implementation
20 is subject to change. </em>
21 
22 @about
23 A conditional random field (CRF) is a type of discriminative, undirected probabilistic graphical model. A linear-chain CRF is a special
24 type of CRF that assumes the current state depends only on the previous state.
25 
26 Specifically, a linear-chain CRF is a distribution defined by
27 \f[
28  p_\lambda(\boldsymbol y | \boldsymbol x) =
29  \frac{\exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y)}}{Z_\lambda(\boldsymbol x)}
30  \,.
31 \f]
32 
33 where
34 - \f$ F_m(\boldsymbol x, \boldsymbol y) = \sum_{i=1}^n f_m(y_i,y_{i-1},x_i) \f$ is a global feature function that is a sum along a sequence
35  \f$ \boldsymbol x \f$ of length \f$ n \f$
36 - \f$ f_m(y_i,y_{i-1},x_i) \f$ is a local feature function dependent on the current token label \f$ y_i \f$, the previous token label \f$ y_{i-1} \f$,
37  and the observation \f$ x_i \f$
38 - \f$ \lambda_m \f$ is the corresponding feature weight
39 - \f$ Z_\lambda(\boldsymbol x) \f$ is an instance-specific normalizer
40 \f[
41 Z_\lambda(\boldsymbol x) = \sum_{\boldsymbol y'} \exp{\sum_{m=1}^M \lambda_m F_m(\boldsymbol x, \boldsymbol y')}
42 \f]
43 
44 A linear-chain CRF estimates the weights \f$ \lambda_m \f$ by maximizing the log-likelihood
45 of a given training set \f$ T=\{(x_k,y_k)\}_{k=1}^N \f$.
46 
47 The log-likelihood is defined as
48 \f[
49  \ell_{\lambda}=\sum_k \log p_\lambda(y_k|x_k) =\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)]
50 \f]
51 
52 and the zero of its gradient
53 \f[
54  \nabla \ell_{\lambda}=\sum_k[F(x_k,y_k)-E_{p_\lambda(Y|x_k)}[F(x_k,Y)]]
55 \f]
56 
57 is found since the maximum likelihood is reached when the empirical average of the global feature vector equals its model expectation. The MADlib implementation uses limited-memory BFGS (L-BFGS), a limited-memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, a quasi-Newton method for unconstrained optimization.
58 
59 \f$E_{p_\lambda(Y|x)}[F(x,Y)]\f$ is found by using a variant of the forward-backward algorithm:
60 \f[
61  E_{p_\lambda(Y|x)}[F(x,Y)] = \sum_y p_\lambda(y|x)F(x,y)
62  = \sum_i\frac{\alpha_{i-1}(f_i*M_i)\beta_i^T}{Z_\lambda(x)}
63 \f]
64 \f[
65  Z_\lambda(x) = \alpha_n.1^T
66 \f]
67  where \f$\alpha_i\f$ and \f$ \beta_i\f$ are the forward and backward state cost vectors defined by
68 \f[
69  \alpha_i =
70  \begin{cases}
71  \alpha_{i-1}M_i, & 0<i<=n\\
72  1, & i=0
73  \end{cases}\\
74 \f]
75 \f[
76  \beta_i^T =
77  \begin{cases}
78  M_{i+1}\beta_{i+1}^T, & 1<=i<n\\
79  1, & i=n
80  \end{cases}
81 \f]
82 
83 To avoid overfitting, we penalize the likelihood with a spherical Gaussian weight prior:
84 \f[
85  \ell_{\lambda}^\prime=\sum_k[\sum_{m=1}^M \lambda_m F_m(x_k,y_k) - \log Z_\lambda(x_k)] - \frac{\lVert \lambda \rVert^2}{2\sigma ^2}
86 \f]
87 
88 \f[
89  \nabla \ell_{\lambda}^\prime=\sum_k[F(x_k,y_k) - E_{p_\lambda(Y|x_k)}[F(x_k,Y)]] - \frac{\lambda}{\sigma ^2}
90 \f]
91 
92 
93 
94 Feature extraction modules are provided for text-analysis
95 tasks such as part-of-speech (POS) tagging and named-entity resolution (NER). Currently, six feature types are implemented:
96 - Edge Feature: transition feature that encodes the transition feature
97 weight from current label to next label.
98 - Start Feature: fired when the current token is the first token in a sequence.
99 - End Feature: fired when the current token is the last token in a sequence.
100 - Word Feature: fired when the current token is observed in the trained
101 dictionary.
102 - Unknown Feature: fired when the current token is not observed in the trained
103 dictionary for at least a certain number of times (default 1).
104 - Regex Feature: fired when the current token can be matched by a regular
105 expression.
106 
107 A Viterbi implementation is also provided
108 to get the best label sequence and the conditional probability
109 \f$ \Pr( \text{best label sequence} \mid \text{sequence}) \f$.
110 
111 For a full example of how to use the MADlib CRF modules for a text analytics application, see the "Example" section below.
112 
113 @input
114 - User-provided input:\n
115 The user is expected to at least provide the label table, the regular expression table, and the segment table:
116 <pre>{TABLE|VIEW} <em>labelTableName</em> (
117  ...
118  <em>id</em> INTEGER,
119  <em>label</em> TEXT,
120  ...
121 )</pre>
122 where <em>id</em> is a unique ID for the label and <em>label</em> is the label name.
123 <pre>{TABLE|VIEW} <em>regexTableName</em> (
124  ...
125  <em>pattern</em> TEXT,
126  <em>name</em> TEXT,
127  ...
128 )</pre>
129 where <em>pattern</em> is a regular expression pattern (e.g. '^.+ing$') and <em>name</em> is a name for the regular expression pattern (e.g. 'endsWithIng').
130 <pre>{TABLE|VIEW} <em>segmentTableName</em> (
131  ...
132  <em>start_pos</em> INTEGER,
133  <em>doc_id</em> INTEGER,
134  <em>seg_text</em> TEXT,
135  <em>label</em> INTEGER,
136  <em>max_pos</em> INTEGER,
137  ...
138 )</pre>
139 where <em>start_pos</em> is the position of the word in the sequence, <em>doc_id</em> is a unique ID for the sequence, <em>seg_text</em> is the word, <em>label</em> is the label for the word, and <em>max_pos</em> is the length of the sequence.
140 
141 - Training (\ref lincrf) input:\n
142 The feature table used for training is expected to be of the following form (this table can also be generated by \ref crf_train_fgen):\n
143 <pre>{TABLE|VIEW} <em>featureTableName</em> (
144  ...
145  <em>doc_id</em> INTEGER,
146  <em>f_size</em> INTEGER,
147  <em>sparse_r</em> FLOAT8[],
148  <em>dense_m</em> FLOAT8[],
149  <em>sparse_m</em> FLOAT8[],
150  ...
151 )</pre>
152 where
153  - <em>doc_id</em> is a unique ID for the sequence
154  - <em>f_size</em> is the number of features
155  - <em>sparse_r</em> is the array union of (previous label, label, feature index, start position, training existance indicator) of individal single-state features (e.g. word features, regex features) ordered by their start positon
156  - <em>dense_m</em> is the array union of (previous label, label, feature index, start position, training existance indicator) of edge features ordered by start position
157  - <em>sparse_m</em> is the array union of (feature index, previous label, label) of edge features ordered by feature index.
158 Edge features were split into dense_m and sparse_m for performance reasons.
159 
160 The set of features used for training is expected to be of the following form (also can be generated by \ref crf_train_fgen):\n
161 <pre>{TABLE|VIEW} <em>featureSetName</em> (
162  ...
163  <em>f_index</em> INTEGER,
164  <em>f_name</em> TEXT,
165  <em>feature_labels</em> INTEGER[],
166  ...
167 )</pre>
168 where
169  - <em>f_index</em> is a unique ID for the feature
170  - <em>f_name</em> is the feature name
171  - <em>feature_labels</em> is an array representing {previous label, label}.
172 
173 The empty feature weight table (which will be populated after training) is expected to be of the following form:
174 <pre>{TABLE|VIEW} <em>featureWeightsName</em> (
175  ...
176  <em>f_index</em> INTEGER,
177  <em>f_name</em> TEXT,
178  <em>previous_label</em> INTEGER,
179  <em>label</em> INTEGER,
180  <em>weight</em> FLOAT8,
181  ...
182 )</pre>
183 
184 @usage
185 - Get number of iterations and weights for features:\n
186  <pre>SELECT * FROM \ref lincrf(
187  '<em>featureTableName</em>', '<em>sparse_r</em>', '<em>dense_m</em>','<em>sparse_m</em>', '<em>f_size</em>', <em>tag_size</em>, '<em>feature_set</em>', '<em>featureWeightsName</em>'
188  [, <em>maxNumberOfIterations</em> ] ]
189 );</pre>
190  where tag_size is the total number of labels.
191 
192  Output:
193 <pre> lincrf
194 -----------------
195  [number of iterations]</pre>
196 
197  <em>featureWeightsName</em>:
198 <pre> id | name | prev_label_id | label_id | weight
199 ----+----------------+---------------+----------+-------------------
200 </pre>
201 
202 - Generate text features, calculate their weights, and output the best label sequence for test data:\n
203  -# Create tables to store the input data, intermediate data, and output data.
204  Also import the training data to the database.
205  <pre>SELECT madlib.crf_train_data(
206  '<em>/path/to/data</em>');</pre>
207  -# Generate text analytics features for the training data.
208  <pre>SELECT madlib.crf_train_fgen(
209  '<em>segmenttbl</em>',
210  '<em>regextbl</em>',
211  '<em>dictionary</em>',
212  '<em>featuretbl</em>',
213  '<em>featureset</em>');</pre>
214  -# Use linear-chain CRF for training.
215  <pre>SELECT madlib.lincrf(
216  '<em>source</em>',
217  '<em>sparse_r</em>',
218  '<em>dense_m</em>',
219  '<em>sparse_m</em>',
220  '<em>f_size</em>',
221  <em>tag_size</em>,
222  '<em>feature_set</em>',
223  '<em>featureWeights</em>',
224  '<em>maxNumIterations</em>');</pre>
225  -# Import CRF model to the database.
226  Also load the CRF testing data to the database.
227  <pre>SELECT madlib.crf_test_data(
228  '<em>/path/to/data</em>');</pre>
229  -# Generate text analytics features for the testing data.
230  <pre>SELECT madlib.crf_test_fgen(
231  '<em>segmenttbl</em>',
232  '<em>dictionary</em>',
233  '<em>labeltbl</em>',
234  '<em>regextbl</em>',
235  '<em>featuretbl</em>',
236  '<em>viterbi_mtbl</em>',
237  '<em>viterbi_rtbl</em>');</pre>
238  'viterbi_mtbl' and 'viterbi_rtbl' are simply text representing names for tables created in the feature generation module (i.e. they are NOT empty tables).
239  -# Run the Viterbi function to get the best label sequence and the conditional
240  probability \f$ \Pr( \text{best label sequence} \mid \text{sequence}) \f$.
241  <pre>SELECT madlib.vcrf_label(
242  '<em>segmenttbl</em>',
243  '<em>viterbi_mtbl</em>',
244  '<em>viterbi_rtbl</em>',
245  '<em>labeltbl</em>',
246  '<em>resulttbl</em>');</pre>
247 
248 @examp
249 -# Load the label table, the regular expressions table, and the training segment table:
250 @verbatim
251 sql> SELECT * FROM crf_label;
252  id | label
253 ----+-------
254  1 | CD
255  13 | NNP
256  15 | PDT
257  17 | PRP
258  29 | VBN
259  31 | VBZ
260  33 | WP
261  35 | WRB
262 ...
263 
264 sql> SELECT * from crf_regex;
265  pattern | name
266 ---------------+----------------------
267  ^.+ing$ | endsWithIng
268  ^[A-Z][a-z]+$ | InitCapital
269  ^[A-Z]+$ | isAllCapital
270  ^.*[0-9]+.*$ | containsDigit
271 ...
272 
273 sql> SELECT * from train_segmenttbl;
274  start_pos | doc_id | seg_text | label | max_pos
275 -----------+--------+------------+-------+---------
276  8 | 1 | alliance | 11 | 26
277  10 | 1 | Ford | 13 | 26
278  12 | 1 | that | 5 | 26
279  24 | 1 | likely | 6 | 26
280  26 | 1 | . | 43 | 26
281  8 | 2 | interest | 11 | 10
282  10 | 2 | . | 43 | 10
283  9 | 1 | after | 5 | 26
284  11 | 1 | concluded | 27 | 26
285  23 | 1 | the | 2 | 26
286  25 | 1 | return | 11 | 26
287  9 | 2 | later | 19 | 10
288 ...
289 @endverbatim
290 -# Create the (empty) dictionary table, feature table, and feature set:
291 @verbatim
292 sql> CREATE TABLE crf_dictionary(token text,total integer);
293 sql> CREATE TABLE train_featuretbl(doc_id integer,f_size FLOAT8,sparse_r FLOAT8[],dense_m FLOAT8[],sparse_m FLOAT8[]);
294 sql> CREATE TABLE train_featureset(f_index integer, f_name text, feature integer[]);
295 @endverbatim
296 -# Generate the training features:
297 @verbatim
298 sql> SELECT crf_train_fgen('train_segmenttbl', 'crf_regex', 'crf_dictionary', 'train_featuretbl','train_featureset');
299 
300 sql> SELECT * from crf_dictionary;
301  token | total
302 ------------+-------
303  talks | 1
304  that | 1
305  would | 1
306  alliance | 1
307  Saab | 2
308  cost | 1
309  after | 1
310  operations | 1
311 ...
312 
313 sql> SELECT * from train_featuretbl;
314  doc_id | f_size | sparse_r | dense_m | sparse_m
315 --------+--------+-------------------------------+---------------------------------+-----------------------
316  2 | 87 | {-1,13,12,0,1,-1,13,9,0,1,..} | {13,31,79,1,1,31,29,70,2,1,...} | {51,26,2,69,29,17,...}
317  1 | 87 | {-1,13,0,0,1,-1,13,9,0,1,...} | {13,0,62,1,1,0,13,54,2,1,13,..} | {51,26,2,69,29,17,...}
318 
319 sql> SELECT * from train_featureset;
320  f_index | f_name | feature
321 ---------+---------------+---------
322  1 | R_endsWithED | {-1,29}
323  13 | W_outweigh | {-1,26}
324  29 | U | {-1,5}
325  31 | U | {-1,29}
326  33 | U | {-1,12}
327  35 | W_a | {-1,2}
328  37 | W_possible | {-1,6}
329  15 | W_signaled | {-1,29}
330  17 | End. | {-1,43}
331  49 | W_'s | {-1,16}
332  63 | W_acquire | {-1,26}
333  51 | E. | {26,2}
334  69 | E. | {29,17}
335  71 | E. | {2,11}
336  83 | W_the | {-1,2}
337  85 | E. | {16,11}
338  4 | W_return | {-1,11}
339 ...
340 
341 @endverbatim
342 -# Create the (empty) feature weight table:
343 @verbatim
344 sql> CREATE TABLE train_crf_feature (id integer,name text,prev_label_id integer,label_id integer,weight float);
345 @endverbatim
346 -# Train using linear CRF:
347 @verbatim
348 sql> SELECT lincrf('train_featuretbl','sparse_r','dense_m','sparse_m','f_size',45, 'train_featureset','train_crf_feature', 20);
349  lincrf
350 --------
351  20
352 
353 sql> SELECT * from train_crf_feature;
354  id | name | prev_label_id | label_id | weight
355 ----+---------------+---------------+----------+-------------------
356  1 | R_endsWithED | -1 | 29 | 1.54128249293937
357  13 | W_outweigh | -1 | 26 | 1.70691232223653
358  29 | U | -1 | 5 | 1.40708515869008
359  31 | U | -1 | 29 | 0.830356200936407
360  33 | U | -1 | 12 | 0.769587378281239
361  35 | W_a | -1 | 2 | 2.68470625883726
362  37 | W_possible | -1 | 6 | 3.41773107604468
363  15 | W_signaled | -1 | 29 | 1.68187039165771
364  17 | End. | -1 | 43 | 3.07687845517082
365  49 | W_'s | -1 | 16 | 2.61430312229883
366  63 | W_acquire | -1 | 26 | 1.67247047385797
367  51 | E. | 26 | 2 | 3.0114240119435
368  69 | E. | 29 | 17 | 2.82385531733866
369  71 | E. | 2 | 11 | 3.00970493772732
370  83 | W_the | -1 | 2 | 2.58742315259326
371 ...
372 
373 @endverbatim
374 -# To find the best labels for a test set using the trained linear CRF model, repeat steps #1-2 and generate the test features, except instead of creating a new dictionary, use the dictionary generated from the training set.
375 @verbatim
376 sql> SELECT * from test_segmenttbl;
377  start_pos | doc_id | seg_text | max_pos
378 -----------+--------+-------------+---------
379  1 | 1 | collapse | 22
380  13 | 1 | , | 22
381  15 | 1 | is | 22
382  17 | 1 | a | 22
383  4 | 1 | speculation | 22
384  6 | 1 | Ford | 22
385  18 | 1 | defensive | 22
386  20 | 1 | with | 22
387 ...
388 
389 sql> SELECT crf_test_fgen('test_segmenttbl','crf_dictionary','crf_label','crf_regex','train_crf_feature','viterbi_mtbl','viterbi_rtbl');
390 @endverbatim
391 -# Calculate the best label sequence:
392 @verbatim
393 sql> SELECT vcrf_label('test_segmenttbl','viterbi_mtbl','viterbi_rtbl','crf_label','extracted_best_labels');
394 
395 sql> SELECT * FROM extracted_best_labels;
396  doc_id | start_pos | seg_text | label | id | prob
397 --------+-----------+-------------+-------+----+-------
398  1 | 2 | Friday | NNP | 14 | 9e-06
399  1 | 6 | Ford | NNP | 14 | 9e-06
400  1 | 12 | Jaguar | NNP | 14 | 9e-06
401  1 | 3 | prompted | VBD | 28 | 9e-06
402  1 | 8 | intensify | NN | 12 | 9e-06
403  1 | 14 | which | NN | 12 | 9e-06
404  1 | 18 | defensive | NN | 12 | 9e-06
405  1 | 21 | GM | NN | 12 | 9e-06
406  1 | 22 | . | . | 44 | 9e-06
407  1 | 1 | collapse | CC | 1 | 9e-06
408  1 | 7 | would | POS | 17 | 9e-06
409 ...
410 @endverbatim
411 (Note that this example was done on a trivial training and test data set.)
412 
413 @literature
414 [1] F. Sha, F. Pereira. Shallow Parsing with Conditional Random Fields, http://www-bcf.usc.edu/~feisha/pubs/shallow03.pdf
415 
416 [2] Wikipedia, Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field
417 
418 [3] A. Jaiswal, S.Tawari, I. Mansuri, K. Mittal, C. Tiwari (2012), CRF, http://crf.sourceforge.net/
419 
420 [4] D. Wang, ViterbiCRF, http://www.cs.berkeley.edu/~daisyw/ViterbiCRF.html
421 
422 [5] Wikipedia, Viterbi Algorithm, http://en.wikipedia.org/wiki/Viterbi_algorithm
423 
424 [6] J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782
425 
426 [7] J. Nocedal, Software for Large-scale Unconstrained Optimization, http://users.eecs.northwestern.edu/~nocedal/lbfgs.html
427 
428 @sa File crf.sql_in crf_feature_gen.sql_in viterbi.sql_in (documenting the SQL functions)
429 
430 */
431 
432 DROP TYPE IF EXISTS MADLIB_SCHEMA.lincrf_result;
433 CREATE TYPE MADLIB_SCHEMA.lincrf_result AS (
434  coef DOUBLE PRECISION[],
435  log_likelihood DOUBLE PRECISION,
436  num_iterations INTEGER
437 );
438 
439 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.lincrf_lbfgs_step_transition(
440  DOUBLE PRECISION[],
441  DOUBLE PRECISION[],
442  DOUBLE PRECISION[],
443  DOUBLE PRECISION[],
444  DOUBLE PRECISION,
445  DOUBLE PRECISION,
446  DOUBLE PRECISION[])
447 RETURNS DOUBLE PRECISION[]
448 AS 'MODULE_PATHNAME'
449 LANGUAGE C IMMUTABLE;
450 
451 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.lincrf_lbfgs_step_merge_states(
452  state1 DOUBLE PRECISION[],
453  state2 DOUBLE PRECISION[])
454 RETURNS DOUBLE PRECISION[]
455 AS 'MODULE_PATHNAME'
456 LANGUAGE C IMMUTABLE STRICT;
457 
458 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.lincrf_lbfgs_step_final(
459  state DOUBLE PRECISION[])
460 RETURNS DOUBLE PRECISION[]
461 AS 'MODULE_PATHNAME'
462 LANGUAGE C IMMUTABLE STRICT;
463 
464 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_lincrf_lbfgs_converge(
465  /*+ state */ DOUBLE PRECISION[])
466 RETURNS DOUBLE PRECISION AS
467 'MODULE_PATHNAME'
468 LANGUAGE c IMMUTABLE STRICT;
469 
470 
471 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_lincrf_lbfgs_result(
472  /*+ state */ DOUBLE PRECISION[])
473 RETURNS MADLIB_SCHEMA.lincrf_result AS
474 'MODULE_PATHNAME'
475 LANGUAGE c IMMUTABLE STRICT;
476 
477 /**
478  * @internal
479  * @brief Perform one iteration of the L-BFGS method for computing
480  * conditional random field
481  */
482 CREATE AGGREGATE MADLIB_SCHEMA.lincrf_lbfgs_step(
483  /* sparse_r columns */ DOUBLE PRECISION[],
484  /* dense_m columns */ DOUBLE PRECISION[],
485  /* sparse_m columns */ DOUBLE PRECISION[],
486  /* feature size */ DOUBLE PRECISION,
487  /* tag size */ DOUBLE PRECISION,
488  /* previous_state */ DOUBLE PRECISION[]) (
489 
490  STYPE=DOUBLE PRECISION[],
491  SFUNC=MADLIB_SCHEMA.lincrf_lbfgs_step_transition,
492  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.lincrf_lbfgs_step_merge_states,')
493  FINALFUNC=MADLIB_SCHEMA.lincrf_lbfgs_step_final,
494  INITCOND='{0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
495 );
496 
497 m4_changequote(<!,!>)
498 m4_ifdef(<!__HAS_ORDERED_AGGREGATES__!>,<!
499 CREATE
500 m4_ifdef(<!__GREENPLUM__!>,<!ORDERED!>)
501 AGGREGATE MADLIB_SCHEMA.array_union(anyarray) (
502  SFUNC = array_cat,
503  STYPE = anyarray
504 );
505 !>)
506 m4_changequote(`,')
507 
508 -- We only need to document the last one (unfortunately, in Greenplum we have to
509 -- use function overloading instead of default arguments).
510 CREATE FUNCTION MADLIB_SCHEMA.compute_lincrf(
511  "source" VARCHAR,
512  "sparse_R" VARCHAR,
513  "dense_M" VARCHAR,
514  "sparse_M" VARCHAR,
515  "featureSize" VARCHAR,
516  "tagSize" INTEGER,
517  "maxNumIterations" INTEGER)
518 RETURNS INTEGER
519 AS $$PythonFunction(crf, crf, compute_lincrf)$$
520 LANGUAGE plpythonu VOLATILE;
521 
522 /**
523  * @brief Compute linear-chain crf coefficients and diagnostic statistics
524  *
525  * @param source Name of the source relation containing the training data
526  * @param sparse_R Name of the sparse single state feature column (of type DOUBLE PRECISION[])
527  * @param dense_M Name of the dense two state feature column (of type DOUBLE PRECISION[])
528  * @param sparse_M Name of the sparse two state feature column (of type DOUBLE PRECISION[])
529  * @param featureSize Name of feature size column (of type DOUBLE PRECISION)
530  * @param tagSize The number of tags in the tag set
531  * @param featureset The unique feature set
532  * @param crf_feature The Name of output feature table
533  * @param maxNumIterations The maximum number of iterations
534  *
535  * @return a composite value:
536  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
537  * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$
538  * - <tt>num_iterations INTEGER</tt> - The number of iterations before the
539  * algorithm terminated \n\n
540  * A 'crf_feature' table is used to store all the features and corresponding weights
541  *
542  * @note This function starts an iterative algorithm. It is not an aggregate
543  * function. Source and column names have to be passed as strings (due to
544  * limitations of the SQL syntax).
545  *
546  * @internal
547  * @sa This function is a wrapper for crf::compute_lincrf(), which
548  * sets the default values.
549  */
550 
551 CREATE FUNCTION MADLIB_SCHEMA.lincrf(
552  "source" VARCHAR,
553  "sparse_R" VARCHAR,
554  "dense_M" VARCHAR,
555  "sparse_M" VARCHAR,
556  "featureSize" VARCHAR,
557  "tagSize" INTEGER,
558  "featureset" VARCHAR,
559  "crf_feature" VARCHAR,
560  "maxNumIterations" INTEGER /*+ DEFAULT 20 */)
561 RETURNS INTEGER AS $$
562 DECLARE
563  theIteration INTEGER;
564 BEGIN
565  theIteration := (
566  SELECT MADLIB_SCHEMA.compute_lincrf($1, $2, $3, $4, $5, $6, $9)
567  );
568  -- Because of Greenplum bug MPP-10050, we have to use dynamic SQL (using
569  -- EXECUTE) in the following
570  -- Because of Greenplum bug MPP-6731, we have to hide the tuple-returning
571  -- function in a subquery
572  EXECUTE
573  $sql$
574  INSERT INTO $sql$ || $8 || $sql$
575  SELECT f_index, f_name, feature[1], feature[2], (result).coef[f_index+1]
576  FROM (
577  SELECT MADLIB_SCHEMA.internal_lincrf_lbfgs_result(_madlib_state) AS result
578  FROM _madlib_iterative_alg
579  WHERE _madlib_iteration = $sql$ || theIteration || $sql$
580  ) subq, $sql$ || $7 || $sql$
581  $sql$;
582  RETURN theIteration;
583 END;
584 $$ LANGUAGE plpgsql VOLATILE;
585 
586 CREATE FUNCTION MADLIB_SCHEMA.lincrf(
587  "source" VARCHAR,
588  "sparse_R" VARCHAR,
589  "dense_M" VARCHAR,
590  "sparse_M" VARCHAR,
591  "featureSize" VARCHAR,
592  "tagSize" INTEGER,
593  "featureset" VARCHAR,
594  "crf_feature" VARCHAR)
595 RETURNS INTEGER AS
596 $$SELECT MADLIB_SCHEMA.lincrf($1, $2, $3, $4, $5, $6, $7, $8, 20);$$
597 LANGUAGE sql VOLATILE;