2.1.0
User Documentation for Apache MADlib
Stratified Sampling

Stratified sampling is a method for independently sampling subpopulations (strata). It is commonly used to reduce sampling error by ensuring that subgroups are adequately represented in the sample.

Stratified Sampling
stratified_sample(  source_table,
                    output_table,
                    proportion,
                    grouping_cols,
                    target_cols,
                    with_replacement
                  )

Arguments

source_table

TEXT. Name of the table containing the input data.

output_table

TEXT. Name of output table that contains the sampled data. The output table contains all columns present in the source table unless otherwise specified in the 'target_cols' parameter below.

proportion

FLOAT8 in the range (0,1). Each stratum is sampled independently.

grouping_cols (optional)

TEXT, default: NULL. A single column or a list of comma-separated columns that defines the strata. When this parameter is NULL, no grouping is used so the sampling is non-stratified, that is, the whole table is treated as a single group.

target_cols (optional)

TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'. If NULL or '*', all columns from the 'source_table' will appear in the 'output_table'.

Note
Do not include 'grouping_cols' in the parameter 'target_cols', because they are always included in the 'output_table'.
with_replacement (optional)
BOOLEAN, default FALSE. Determines whether to sample with replacement or without replacement (default). With replacement means that it is possible that the same row may appear in the sample set more than once. Without replacement means a given row can be selected only once.

Examples

Please note that due to the random nature of sampling, your results may look different from those below.

  1. Create an input table:
    DROP TABLE IF EXISTS test;
    CREATE TABLE test(
        id1 INTEGER,
        id2 INTEGER,
        gr1 INTEGER,
        gr2 INTEGER
    );
    INSERT INTO test VALUES
    (1,0,1,1),
    (2,0,1,1),
    (3,0,1,1),
    (4,0,1,1),
    (5,0,1,1),
    (6,0,1,1),
    (7,0,1,1),
    (8,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (0,1,1,2),
    (0,2,1,2),
    (0,3,1,2),
    (0,4,1,2),
    (0,5,1,2),
    (0,6,1,2),
    (10,10,2,2),
    (20,20,2,2),
    (30,30,2,2),
    (40,40,2,2),
    (50,50,2,2),
    (60,60,2,2),
    (70,70,2,2);
    
  2. Sample without replacement:
    DROP TABLE IF EXISTS out;
    SELECT madlib.stratified_sample(
                                    'test',    -- Source table
                                    'out',     -- Output table
                                    0.5,       -- Sample proportion
                                    'gr1,gr2', -- Strata definition
                                    'id1,id2', -- Columns to output
                                    FALSE);    -- Sample without replacement
    SELECT * FROM out ORDER BY gr1,gr2,id1,id2;
    
     gr1 | gr2 | id1 | id2
    -----+-----+-----+-----
       1 |   1 |   2 |   0
       1 |   1 |   4 |   0
       1 |   1 |   7 |   0
       1 |   1 |   8 |   0
       1 |   1 |   9 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   2
       1 |   2 |   0 |   3
       1 |   2 |   0 |   4
       2 |   2 |  20 |  20
       2 |   2 |  30 |  30
       2 |   2 |  40 |  40
       2 |   2 |  60 |  60
    (13 rows)
    
  3. Sample with replacement:
    DROP TABLE IF EXISTS out;
    SELECT madlib.stratified_sample(
                                    'test',    -- Source table
                                    'out',     -- Output table
                                    0.5,       -- Sample proportion
                                    'gr1,gr2', -- Strata definition
                                    'id1,id2', -- Columns to output
                                    TRUE);     -- Sample with replacement
    SELECT * FROM out ORDER BY gr1,gr2,id1,id2;
    
     gr1 | gr2 | id1 | id2
    ----—+----—+----—+----—
       1 |   1 |   3 |   0
       1 |   1 |   6 |   0
       1 |   1 |   6 |   0
       1 |   1 |   7 |   0
       1 |   1 |   7 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   1
       1 |   2 |   0 |   2
       1 |   2 |   0 |   6
       2 |   2 |  20 |  20
       2 |   2 |  30 |  30
       2 |   2 |  50 |  50
       2 |   2 |  50 |  50