SQL function for POS/NER feature extraction.
More...
|
void | crf_train_fgen (text segmenttbl, text regextbl, text dictionary, text featuretbl, text featureset) |
| This function extracts POS/NER features from the training data. More...
|
|
void | crf_test_fgen (text segmenttbl, text dictionary, text labeltbl, text regextbl, text featuretbl, text viterbi_mtbl, text viterbi_rtbl) |
| This function extracts POS/NER features from the testing data. More...
|
|
- Date
- February 2012
- See Also
- For an introduction to POS/NER feature extraction, see the module description Conditional Random Field
void crf_test_fgen |
( |
text |
segmenttbl, |
|
|
text |
dictionary, |
|
|
text |
labeltbl, |
|
|
text |
regextbl, |
|
|
text |
featuretbl, |
|
|
text |
viterbi_mtbl, |
|
|
text |
viterbi_rtbl |
|
) |
| |
This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.
- viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.
startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:
0 1 2 3 4 5...44
startFeature -1 a a a a a a...a
edgeFeature 0 a a a a a a...a
edgeFeature 1 a a a a a a...a
...
edgeFeature 44 a a a a a a...a
endFeature 45 a a a a a a...a
- Parameters
-
segmenttbl | Name of table containing all the tokenized testing sentences. |
dictionary | Name of table containing the dictionary. |
labeltbl | Name of table containing the the label space used in POS or other NLP tasks. |
regextbl | Name of table containing all the regular expressions to capture regex features. |
featuretbl | Name of table containing features. |
viterbi_mtbl | Name of table to store the m factors. |
viterbi_rtbl | Name of table to store the r factors. |
void crf_train_fgen |
( |
text |
segmenttbl, |
|
|
text |
regextbl, |
|
|
text |
dictionary, |
|
|
text |
featuretbl, |
|
|
text |
featureset |
|
) |
| |
- Parameters
-
segmenttbl | Name of table containing all the tokenized training sentences. |
regextbl | Name of table containing all the regular expressions to capture regex features. |
dictionary | Name of table containing the dictionary. |
featuretbl | features generated from the traning dataset |
featureset | unique feature set generated from the training dataset |