tf(t,d)
is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t
occurs in document d
. For this function, 'word' and 'term' are used interchangeably. Note: the term frequency is not normalized by the document length. term_frequency(input_table, doc_id_col, word_col, output_table, compute_vocab)
Arguments:
TEXT. The name of the table storing the documents. Each row is in the form <doc_id, word_vector> where doc_id
is an id, unique to each document, and word_vector
is a text array containing the words in the document. The word_vector
should contain multiple entries of a word if the document contains multiple occurrence of that word.
TEXT. The name of the column containing the document id.
TEXT. The name of the column containing the vector of words/terms in the document. This column should of type that can be cast to TEXT[].
TEXT. The name of the table to store the term frequency output. The output table contains the following columns:
id_col:
This the document id column (same as the one provided as input).word:
A word/term present in a document. This is either the original word present in word_col
or an id representing the word (depending on the value of compute_vocab below).count:
The number of times this word is found in the document. wordid:
An id assignment for each wordword:
The word/term DROP TABLE IF EXISTS documents; CREATE TABLE documents(docid INTEGER, doc_contents TEXT); INSERT INTO documents VALUES (1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'), (2, 'Chinchillas and kittens are cute.'), (3, 'My sister adopted two kittens yesterday'), (4, 'Look at this cute hamster munching on a piece of broccoli');
ALTER TABLE documents ADD COLUMN words TEXT[]; UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[\\s+\\.]');
DROP TABLE IF EXISTS documents_tf; SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf'); SELECT * FROM documents_tf order by docid;
docid | word | count -------+------------+------- 1 | ate | 1 1 | like | 1 1 | breakfast | 1 1 | to | 1 1 | broccoli | 1 1 | spinach | 1 1 | i | 2 1 | and | 2 1 | a | 1 1 | | 2 1 | smoothie | 1 1 | eat | 1 1 | banana | 2 1 | for | 1 2 | cute | 1 2 | are | 1 2 | kitten | 1 2 | and | 1 2 | chinchilla | 1 3 | kitten | 1 3 | my | 1 3 | a | 1 3 | sister | 1 3 | adopted | 1 3 | yesterday | 1 4 | at | 1 4 | of | 1 4 | piece | 1 4 | this | 1 4 | a | 1 4 | broccoli | 1 4 | hamster | 1 4 | munching | 1 4 | cute | 1 4 | look | 1 (35 rows)
DROP TABLE IF EXISTS documents_tf; DROP TABLE IF EXISTS documents_tf_vocabulary; SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf', TRUE); -- Output with wordid instead of the actual words SELECT * FROM documents_tf order by docid;
docid | wordid | count -------+--------+------- 1 | 0 | 2 1 | 1 | 1 1 | 3 | 2 1 | 6 | 1 1 | 7 | 2 1 | 8 | 1 1 | 9 | 1 1 | 12 | 1 1 | 13 | 1 1 | 15 | 2 1 | 17 | 1 1 | 24 | 1 1 | 25 | 1 1 | 27 | 1 2 | 16 | 1 2 | 3 | 1 2 | 4 | 1 2 | 10 | 1 2 | 11 | 1 3 | 1 | 1 3 | 16 | 1 3 | 28 | 1 3 | 23 | 1 3 | 2 | 1 3 | 20 | 1 4 | 9 | 1 4 | 11 | 1 4 | 22 | 1 4 | 14 | 1 4 | 26 | 1 4 | 1 | 1 4 | 5 | 1 4 | 18 | 1 4 | 19 | 1 4 | 21 | 1 (35 rows)
-- Vocabulary SELECT * FROM documents_tf_vocabulary order by wordid;
wordid | word --------+------------ 0 | 1 | a 2 | adopted 3 | and 4 | are 5 | at 6 | ate 7 | banana 8 | breakfast 9 | broccoli 10 | chinchilla 11 | cute 12 | eat 13 | for 14 | hamster 15 | i 16 | kitten 17 | like 18 | look 19 | munching 20 | my 21 | of 22 | piece 23 | sister 24 | smoothie 25 | spinach 26 | this 27 | to 28 | yesterday (29 rows)
File text_utilities.sql_in documenting the SQL functions. File utilities.sql_in documenting the utility functions for DB administration.