User Documentation
 All Files Functions Groups
crf_feature_gen.sql_in File Reference

SQL function for POS/NER feature extraction. More...

Go to the source code of this file.

Functions

void crf_train_fgen (text segmenttbl, text regextbl, text dictionary, text featuretbl, text featureset)
 This function extracts POS/NER features from the training data. More...
 
void crf_test_fgen (text segmenttbl, text dictionary, text labeltbl, text regextbl, text featuretbl, text viterbi_mtbl, text viterbi_rtbl)
 This function extracts POS/NER features from the testing data. More...
 

Detailed Description

Date
February 2012
See Also
For an introduction to POS/NER feature extraction, see the module description Conditional Random Field

Definition in file crf_feature_gen.sql_in.

Function Documentation

void crf_test_fgen ( text  segmenttbl,
text  dictionary,
text  labeltbl,
text  regextbl,
text  featuretbl,
text  viterbi_mtbl,
text  viterbi_rtbl 
)

This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.

  • viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.

startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:

                 0  1  2  3  4  5...44
startFeature -1  a  a  a  a  a  a...a
edgeFeature   0  a  a  a  a  a  a...a
edgeFeature   1  a  a  a  a  a  a...a
...
edgeFeature  44  a  a  a  a  a  a...a
endFeature   45  a  a  a  a  a  a...a
  • viterbi_r table is related to specific tokens. It encodes the single state features, e.g., wordFeature, RegexFeature for all tokens. The r table is represented in the following way.
           0  1  2  3  4...44
    token1 a  a  a  a  a...a
    token2 a  a  a  a  a...a
Parameters
segmenttblName of table containing all the tokenized testing sentences.
dictionaryName of table containing the dictionary.
labeltblName of table containing the the label space used in POS or other NLP tasks.
regextblName of table containing all the regular expressions to capture regex features.
viterbi_mtblName of table to store the m factors.
viterbi_rtblName of table to store the r factors.

Definition at line 231 of file crf_feature_gen.sql_in.

void crf_train_fgen ( text  segmenttbl,
text  regextbl,
text  dictionary,
text  featuretbl,
text  featureset 
)
Parameters
segmenttblName of table containing all the tokenized training sentences.
regextblName of table containing all the regular expressions to capture regex features.
dictionaryName of table containing the dictionary.
featuretblfeatures generated from the traning dataset
featuresetunique featrue set generated from the training dataset

Definition at line 46 of file crf_feature_gen.sql_in.