User Documentation
 All Files Functions Groups
kmeans.sql_in File Reference

Set of functions for k-means clustering. More...

Go to the source code of this file.

Functions

kmeans_result kmeans (varchar rel_source, varchar expr_point, float8[][] initial_centroids, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
 Perform Lloyd's k-means local-search heuristic. More...
 
float8[][] kmeanspp_seeding (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", float8[][] initial_centroids=NULL)
 k-Means++ Seeding More...
 
kmeans_result kmeanspp (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
 Run k-Means++. More...
 
float8[][] kmeans_random_seeding (varchar rel_source, varchar expr_point, integer k, float8[][] initial_centroids=NULL)
 k-Means Random Seeding More...
 
kmeans_result kmeans_random (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
 Run k-Means with random seeding. More...
 
kmeans_result kmeans (varchar rel_source, varchar expr_point, varchar rel_initial_centroids, varchar expr_centroid, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
 Perform Lloyd's k-means local-search heuristic, but with initial centroids stored in a table. More...
 
float8 simple_silhouette (varchar rel_source, varchar expr_point, float8[][] centroids, varchar fn_dist="dist_norm2")
 Compute a simplified version of the silhouette coefficient. More...
 

Detailed Description

See Also
For a brief introduction to k-means clustering, see the module description k-Means Clustering.

Definition in file kmeans.sql_in.

Function Documentation

kmeans_result kmeans ( varchar  rel_source,
varchar  expr_point,
float8  initial_centroids[][],
varchar  fn_dist = "squared_dist_norm2",
varchar  agg_centroid = "avg",
integer  max_num_iterations = 20,
float8  min_frac_reassigned = 0.001 
)
Parameters
rel_sourceName of the relation containing input points
expr_pointExpression evaluating to point coordinates for each tuple
initial_centroidsMatrix containing the initial centroids as columns
fn_distName of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points. The default is the squared Euclidean distance.
agg_centroidName of an aggregate function with signature DOUBLE PRECISION[] -> DOUBLE PRECISION[] that, for each group of points, returns a centroid. In order for Lloyd's local-search heuristic to provably converge and to return a local minimum, this centroid should minimize the sum of distances between each point in the group and the centroid. The default is the average (mean/barycenter in Euclidean space), which satisfies this property if fn_dist = 'squared_dist_norm2'.
max_num_iterationsMaximum number of iterations
min_frac_reassignedFraction of reassigned points below which convergence is assumed and the algorithm terminates
Returns
A composite value:
  • centroids - Matrix with \( k \) centroids as columns.
  • frac_reassigned - Fraction of points that were assigned a different centroid in the last iteration.
  • num_iterations - The number of iterations before the algorithm terminated

Definition at line 348 of file kmeans.sql_in.

kmeans_result kmeans ( varchar  rel_source,
varchar  expr_point,
varchar  rel_initial_centroids,
varchar  expr_centroid,
varchar  fn_dist = "squared_dist_norm2",
varchar  agg_centroid = "avg",
integer  max_num_iterations = 20,
float8  min_frac_reassigned = 0.001 
)

This is a shortcut for running k-means with initial centroids stored in a table (as opposed to an array of centroids). It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    (SELECT matrix_agg($expr_centroid) FROM $rel_initial_centroids),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

where $expr_centroid and $rel_initial_centroids denote textual substituions.

Definition at line 1093 of file kmeans.sql_in.

kmeans_result kmeans_random ( varchar  rel_source,
varchar  expr_point,
integer  k,
varchar  fn_dist = "squared_dist_norm2",
varchar  agg_centroid = "avg",
integer  max_num_iterations = 20,
float8  min_frac_reassigned = 0.001 
)

This is a shortcut for running k-means with random seeding. It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    kmeans_random_seeding(
        rel_source,
        expr_point,
        k
    ),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

Definition at line 910 of file kmeans.sql_in.

float8 [][] kmeans_random_seeding ( varchar  rel_source,
varchar  expr_point,
integer  k,
float8  initial_centroids[][] = NULL 
)
Parameters
rel_sourceName of the relation containing input points
expr_pointExpression evaluating to point coordinates for each tuple
kNumber of centroids
initial_centroidsA matrix containing up to \( k \) columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all \( k \) centroids will be generated.
Returns
A matrix containing \( k \) centroids as columns

Definition at line 804 of file kmeans.sql_in.

kmeans_result kmeanspp ( varchar  rel_source,
varchar  expr_point,
integer  k,
varchar  fn_dist = "squared_dist_norm2",
varchar  agg_centroid = "avg",
integer  max_num_iterations = 20,
float8  min_frac_reassigned = 0.001 
)

This is a shortcut for running k-means++. It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    kmeanspp_seeding(
        rel_source,
        expr_point,
        k,
        fn_dist
    ),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

Definition at line 668 of file kmeans.sql_in.

float8 [][] kmeanspp_seeding ( varchar  rel_source,
varchar  expr_point,
integer  k,
varchar  fn_dist = "squared_dist_norm2",
float8  initial_centroids[][] = NULL 
)
Parameters
rel_sourceName of the relation containing input points
expr_pointExpression evaluating to point coordinates for each tuple
kNumber of centroids
fn_distName of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points
initial_centroidsA matrix containing up to \( k \) columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all \( k \) centroids will be generated.
Returns
A matrix containing \( k \) centroids as columns

Definition at line 544 of file kmeans.sql_in.

float8 simple_silhouette ( varchar  rel_source,
varchar  expr_point,
float8  centroids[][],
varchar  fn_dist = "dist_norm2" 
)
Parameters
rel_sourceName of the relation containing input points
expr_pointExpression evaluating to point coordinates \( x_i \) for each tuple
centroidsMatrix \( M = (\vec{m_0} \dots \vec{m_{k-1}}) \in \mathbb{R}^{d \times k} \) with \( k \) columns, where column \( i \) contains the position of centroid \( i \).
fn_distName of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points
Returns
For each point \( x_i \), let \( d_1( x_i ) \) and \( d_2( x_i ) \) be the distance to the closest and 2nd-closest centroid, respectively. If there is more than one closest centroids then \( d_1( x_i ) = d_2( x_i )\). The return value is the average, over all points \( x_i \), of

\[ \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)}, \]

where 0/0 is interpreted as 0. Clearly, the simplified silhouette coefficient assumes values in \( [0,1] \).

Definition at line 1230 of file kmeans.sql_in.