This module provides a basic stemming operation for text input. It is a support module for several machine learning algorithms that require a stemmer. Currently, it only supports English words.
This function is a SQL interface to the implementation of the Porter Stemming Algorithm. The original stemming algorithm is written and maintained by Martin Porter
All functions described in this module work with text OR text array.
Several of the function require TEXT VALUES, and returns NULL for a NULL input. See details in description of individual functions.
stem_token() | Returns the stem of the token. Returns NULL if input is NULL. |
---|---|
stem_token_arr() | Returns the stems in an array of input token array. The stem would be NULL for corresponding NULL token. |
CREATE TABLE token_tbl ( id integer, word text ); INSERT INTO token_tbl VALUES (1, 'kneel'), (2, 'kneeled'), (3, 'kneeling'), (4, 'kneels'), (5, 'knees'), (6, 'knell'), (7, 'knelt'), (8, 'knew'), (9, 'knick'), (10, 'knif'), (11, 'knife'), (12, 'knight'), (13, 'knightly'), (14, 'knights'), (15, 'knit'), (16, 'knits'), (17, 'knitted'), (18, 'knitting'), (19, 'knives'), (20, 'knob'), (21, 'knobs'), (22, 'knock'), (23, 'knocked'), (24, 'knocker'), (25, 'knockers'), (26, 'knocking'), (27, 'knocks'), (28, 'knopp'), (29, 'knot'), (30, 'knots');
SELECT madlib.stem_token(word) FROM token_tbl;
stem_token ------------ kneel kneel kneel kneel knee knell knelt knew knick knif knife knight knight knight knit knit knit knit knive knob knob knock knock knocker knocker knock knock knopp knot knot (30 rows)
SELECT madlib.stem_token_arr(array_agg(word order by word)) FROM token_tbl;
stem_token_arr ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {kneel,kneel,kneel,kneel,knee,knell,knelt,knew,knick,knif,knife,knight,knight,knight,knit,knit,knit,knit,knive,knob,knob,knock,knock,knocker,knocker,knock,knock,knopp,knot,knot} (1 row)
File porter_stemmer.sql_in for list of functions and usage.