2.1.0
User Documentation for Apache MADlib

This module provides a basic stemming operation for text input. It is a support module for several machine learning algorithms that require a stemmer. Currently, it only supports English words.

This function is a SQL interface to the implementation of the Porter Stemming Algorithm. The original stemming algorithm is written and maintained by Martin Porter

Implementation Notes

All functions described in this module work with text OR text array.

Several of the function require TEXT VALUES, and returns NULL for a NULL input. See details in description of individual functions.

Stemmer Operations
stem_token()

Returns the stem of the token. Returns NULL if input is NULL.

stem_token_arr()

Returns the stems in an array of input token array. The stem would be NULL for corresponding NULL token.

Examples
  1. Create a table with some words to be stemmed.
    CREATE TABLE token_tbl ( id integer,
                             word text
                           );
    INSERT INTO token_tbl VALUES
     (1, 'kneel'),
     (2, 'kneeled'),
     (3, 'kneeling'),
     (4, 'kneels'),
     (5, 'knees'),
     (6, 'knell'),
     (7, 'knelt'),
     (8, 'knew'),
     (9, 'knick'),
     (10, 'knif'),
     (11, 'knife'),
     (12, 'knight'),
     (13, 'knightly'),
     (14, 'knights'),
     (15, 'knit'),
     (16, 'knits'),
     (17, 'knitted'),
     (18, 'knitting'),
     (19, 'knives'),
     (20, 'knob'),
     (21, 'knobs'),
     (22, 'knock'),
     (23, 'knocked'),
     (24, 'knocker'),
     (25, 'knockers'),
     (26, 'knocking'),
     (27, 'knocks'),
     (28, 'knopp'),
     (29, 'knot'),
     (30, 'knots');
    
  2. Return the stem words
    SELECT madlib.stem_token(word) FROM token_tbl;
    
     stem_token
     ------------
     kneel
     kneel
     kneel
     kneel
     knee
     knell
     knelt
     knew
     knick
     knif
     knife
     knight
     knight
     knight
     knit
     knit
     knit
     knit
     knive
     knob
     knob
     knock
     knock
     knocker
     knocker
     knock
     knock
     knopp
     knot
     knot
    (30 rows)
    
  3. The input can be processed as an array
    SELECT madlib.stem_token_arr(array_agg(word order by word)) FROM token_tbl;
    
      stem_token_arr
     -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     {kneel,kneel,kneel,kneel,knee,knell,knelt,knew,knick,knif,knife,knight,knight,knight,knit,knit,knit,knit,knive,knob,knob,knock,knock,knocker,knocker,knock,knock,knopp,knot,knot}
    (1 row)
    

Related Topics

File porter_stemmer.sql_in for list of functions and usage.