SQL function for POS/NER feature extraction. More...

Functions
void	crf_train_fgen (text segmenttbl, text regextbl, text dictionary, text featuretbl, text featureset)
	This function extracts POS/NER features from the training data. More...

void	crf_test_fgen (text segmenttbl, text dictionary, text labeltbl, text regextbl, text featuretbl, text viterbi_mtbl, text viterbi_rtbl)
	This function extracts POS/NER features from the testing data. More...

Detailed Description

Date: February 2012

See Also: For an introduction to POS/NER feature extraction, see the module description Conditional Random Field

Definition in file crf_feature_gen.sql_in.

Function Documentation

void crf_test_fgen	(	text	segmenttbl,
		text	dictionary,
		text	labeltbl,
		text	regextbl,
		text	featuretbl,
		text	viterbi_mtbl,
		text	viterbi_rtbl
	)

This feature extraction function will produce two factor tables, "m table" (viterbi_mtbl) and "r table" (viterbi_rtbl). The viterbi_mtbl table and viterbi_rtbl table are used to calculate the best label sequence for each sentence.

viterbi_mtbl table encodes the edge features which are solely dependent on upon current label and previous y value. The m table has three columns which are prev_label, label, and value respectively. If the number of labels in \( n \), then the m factor table will \( n^2 \) rows. Each row encodes the transition feature weight value from the previous label to the current label.

startFeature is considered as a special edge feature which is from the beginning to the first token. Likewise, endFeature can be considered as a special edge feature which is from the last token to the very end. So m table encodes the edgeFeature, startFeature, and endFeature. If the total number of labels in the label space is 45 from 0 to 44, then the m factor array is as follows:

                 0  1  2  3  4  5...44
startFeature -1  a  a  a  a  a  a...a
edgeFeature   0  a  a  a  a  a  a...a
edgeFeature   1  a  a  a  a  a  a...a
...
edgeFeature  44  a  a  a  a  a  a...a
endFeature   45  a  a  a  a  a  a...a

viterbi_r table is related to specific tokens. It encodes the single state features, e.g., wordFeature, RegexFeature for all tokens. The r table is represented in the following way.
```
       0  1  2  3  4...44
token1 a  a  a  a  a...a
token2 a  a  a  a  a...a
```

Parameters

segmenttbl	Name of table containing all the tokenized testing sentences.
dictionary	Name of table containing the dictionary.
labeltbl	Name of table containing the the label space used in POS or other NLP tasks.
regextbl	Name of table containing all the regular expressions to capture regex features.
viterbi_mtbl	Name of table to store the m factors.
viterbi_rtbl	Name of table to store the r factors.

Definition at line 231 of file crf_feature_gen.sql_in.

void crf_train_fgen	(	text	segmenttbl,
		text	regextbl,
		text	dictionary,
		text	featuretbl,
		text	featureset
	)

Parameters

segmenttbl	Name of table containing all the tokenized training sentences.
regextbl	Name of table containing all the regular expressions to capture regex features.
dictionary	Name of table containing the dictionary.
featuretbl	features generated from the traning dataset
featureset	unique featrue set generated from the training dataset

Definition at line 46 of file crf_feature_gen.sql_in.

Functions

Detailed Description

Function Documentation