2.1.0
User Documentation for Apache MADlib
Warning
This MADlib method is still in early stage development. Interface and implementation are subject to change.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm designed to discover clusters of arbitrary shape [1,2]. It places minimum requirements on domain knowledge to determine input parameters and has good efficiency on large databases.

Given a set of points, DBSCAN groups together points that are closely packed with many nearby neighbors (core), and marks points that lie alone in low-density regions with few nearby neighbors (border). Other points in the dataset that are not core or border are considered as noise. This method tends to be good for data which contains clusters of similar density.

Currently only a brute force approach is implemented, suitable for small datasets. Other approaches for larger datasets will be implemented in a future release.

Clustering Function
dbscan( source_table,
        output_table,
        id_column,
        expr_point,
        eps,
        min_samples,
        metric,
        algorithm,
        max_segmentation_depth
      )

Arguments

source_table

TEXT. Name of the table containing the input data points.

output_table

TEXT. Name of the table containing the clustering results.

id

TEXT. Name of a column or expression containing a unique integer id for each training point.

point

TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates.

eps

FLOAT8. Maximum distance between two samples for one to be considered in the neighborhood of the other. (Note this is not a maximum bound on the distances of points within a cluster.) This is an important parameter to choose appropriately and should consider both the nature of the data set and the distance function.

min_samples (optional)

INTEGER, default: 5. Number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. This parameter controls how tolerant the algorithm is towards noise, so on noisy data sets it may be useful to increase the magnitude of this parameter.

Note
The parameters 'eps' and 'min_samples' together define the density of a cluster. A core point is where there exist 'min_samples' other points within a distance of 'eps', which are defined as neighbors of the core point. A higher value of 'min_samples' or a lower value of 'eps' indicate that higher density is needed to form a cluster.
metric (optional)
TEXT, default: 'squared_dist_norm2'. The name of the function used to calculate the distance between data points. The following distance functions can be used:
  • dist_norm1: 1-norm/Manhattan (element-wise median). MADlib does not provide a median aggregate function for performance reasons.
  • dist_norm2: 2-norm/Euclidean (element-wise mean)
  • squared_dist_norm2: squared Euclidean distance (element-wise mean)
  • dist_angle: angle (element-wise mean of normalized points)
  • dist_tanimoto: tanimoto (element-wise mean of normalized points)

algorithm (optional)

TEXT, default: 'optimized'. The name of the algorithm used to compute clusters.

  • brute_force: Brute force can be slow and should not be used for large datasets. However, it does have less initial setup overhead than the parllel-optimized algorithm, which may make it faster for some cases involving small datasets. You can also use a short form "b" or "brute" etc. to select brute force.

  • optimized: This uses an rtree index to accelerate range_queries while performing the cluster detection. It is designed with gpdb in mind, in which case it intelligently partitions the space into a number of connected regions, runs DBSCAN in parallel on each region, and then merges the results together. By default, the maximum number of regions it will consider using is equal to the number of segments in the database, but if you suspect it may be spending too much time on the segmentation, this can be limited further by setting the max_segmentation_depth parameter to something lower. This should decrease the segmentation overhead, but will also decrease the amount of parallelism. If the dataset is relatively small, another option to consider is brute force, which has very little overhead but won't scale as well. For large enough datasets, and an appropriate choice of eps, this algorithm should be able to run in O((N/S) log N) time where N is the number of rows (points) in the input and S is the number of regions (often equal to the number of segments, but may be less). (Worst case for brute force is O(N^2).) eps should be chosen based on the density of the dataset, where DBSCAN works best for datasets where the clusters have a roughly uniform density. If eps is chosen too large, then the runtime will explode and nearly every point will be considered connected to every other point in one large cluster. Best practice is to start with a relatively small eps, so it can return the first result faster; if it looks like there are too many small clusters, increase eps and allow it to run for longer.

max_segmentation_depth (optional)

INTEGER, default: <number of="" segments>="">. The number of regions to segment the data. See optimized for usage.

Output
The output table for DBSCAN module has the following columns:

id SMALLINT|INTEGER|BIGINT. A column or expression which specifies unique ids for each row in the training dataset. Must evaluate to an integer type.
cluster_id INTEGER. The cluster id of each classified data point.
is_core_point BOOLEAN. Indicates if the training data point is core If it is not a core point and listed in the output table, it is a border point. Noise points are not shown in this table.
point TEXT. The coordinates of each point classified

Cluster Assignment

After clustering, the cluster assignment for each data point can be computed:

dbscan_predict( dbscan_table,
                source_table,
                id,
                point,
                output_table
                )

Arguments

dbscan_table

TEXT. Name of the table created by running DBSCAN.

source_table

TEXT. Name of the table containing the input data points.

id

VARCHAR. A column name or expression which selects a unique integer id for each training point.

point

TEXT. Name of the column with point coordinates in array form, or an expression that evaluates to an array of point coordinates.

output_table
TEXT. Name of the table containing the clustering results.

Output
The output is a table with the following columns:

id BIGINT. The unique id of each row, as it as passed in.
cluster_id INTEGER. Cluster assignment (zero-based, i.e., 0,1,2...).
distance DOUBLE PRECISION. Distance to the cluster centroid.

Examples
  1. Create input data:
    DROP TABLE IF EXISTS dbscan_train_data;
    CREATE TABLE dbscan_train_data (pid int, points double precision[]);
    INSERT INTO dbscan_train_data VALUES
    (1,  '{1, 1}'),
    (2,  '{2, 1}'),
    (3,  '{1, 2}'),
    (4,  '{2, 2}'),
    (5,  '{3, 5}'),
    (6,  '{3, 9}'),
    (7,  '{3, 10}'),
    (8,  '{4, 10}'),
    (9,  '{4, 11}'),
    (10,  '{5, 10}'),
    (11,  '{7, 10}'),
    (12,  '{10, 9}'),
    (13,  '{10, 6}'),
    (14,  '{9, 5}'),
    (15,  '{10, 5}'),
    (16,  '{11, 5}'),
    (17,  '{9, 4}'),
    (18,  '{10, 4}'),
    (19,  '{11, 4}'),
    (20,  '{10, 3}');
    CREATE TABLE dbscan_test_data (pid int, points double precision[]);
    INSERT INTO dbscan_test_data VALUES
    (1,  '{1, 2}'),
    (2,  '{2, 2}'),
    (3,  '{1, 3}'),
    (4,  '{2, 2}'),
    (10,  '{5, 11}'),
    (11,  '{7, 10}'),
    (12,  '{10, 9}'),
    (13,  '{10, 6}'),
    (14,  '{9, 5}'),
    (15,  '{10, 6}');
    
  2. Run DBSCAN using the brute force method with a Euclidean distance function:
    DROP TABLE IF EXISTS dbscan_result, dbscan_result_summary;
    SELECT madlib.dbscan(
                    'dbscan_train_data',    -- source table
                    'dbscan_result',        -- output table
                    'pid',                  -- point id column
                    'points',               -- data point
                     1.75,                  -- epsilon
                     4,                     -- min samples
                    'dist_norm2',           -- metric
                    'brute_force');         -- algorithm
    SELECT * FROM dbscan_result ORDER BY pid;
    
     pid | cluster_id | is_core_point | __points__
    -----+------------+---------------+------------
       1 |          0 | t             | {1,1}
       2 |          0 | t             | {2,1}
       3 |          0 | t             | {1,2}
       4 |          0 | t             | {2,2}
       6 |          1 | f             | {3,9}
       7 |          1 | t             | {3,10}
       8 |          1 | t             | {4,10}
       9 |          1 | t             | {4,11}
      10 |          1 | f             | {5,10}
      13 |          2 | t             | {10,6}
      14 |          2 | t             | {9,5}
      15 |          2 | t             | {10,5}
      16 |          2 | t             | {11,5}
      17 |          2 | t             | {9,4}
      18 |          2 | t             | {10,4}
      19 |          2 | t             | {11,4}
      20 |          2 | t             | {10,3}
    (17 rows)
    
    There are three clusters created. All points are core points except for 6 and 10 which are border points. The noise points do not show up in the output table. If you want to see the noise points you can use a query like:
    SELECT l.* FROM dbscan_train_data l WHERE NOT EXISTS
        (SELECT NULL FROM dbscan_result r WHERE r.pid = l.pid)
        ORDER BY l.pid;
    
     pid | points
    -----+--------
       5 | {3,5}
      11 | {7,10}
      12 | {10,9}
    
    The summary table lists the 'eps' value and the distance metric used:
    SELECT * FROM dbscan_result_summary;
    
     id | eps  |   metric
    -----------+------+------------
     pid       | 1.75 | dist_norm2
    
  3. Find the cluster assignment for the test data points:
    SELECT madlib.dbscan_predict(
                            'dbscan_result',        -- from DBSCAN run
                            'dbscan_test_data',     -- test dataset
                            'pid',                  -- point id column
                            'points',               -- data point
                            'dbscan_predict_out'    -- output table
                            );
    SELECT * FROM dbscan_predict_out ORDER BY pid;
    
     pid | cluster_id | distance 
    -----+------------+----------
       1 |          0 |        0
       2 |          0 |        0
       3 |          0 |        1
       4 |          0 |        0
      10 |          1 |        1
      13 |          2 |        0
      14 |          2 |        0
      15 |          2 |        0
    (8 rows)
    

Literature

[1] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", KDD-96 Proceedings, https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf

[2] Erich Schubert, Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, July 2017, Article No. 19, https://dl.acm.org/doi/10.1145/3068335