Set of functions for k-means clustering. More...

Functions
kmeans_result	kmeans (varchar rel_source, varchar expr_point, float8[][] initial_centroids, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
	Perform Lloyd's k-means local-search heuristic. More...

float8[][]	kmeanspp_seeding (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", float8[][] initial_centroids=NULL)
	k-Means++ Seeding More...

kmeans_result	kmeanspp (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
	Run k-Means++. More...

float8[][]	kmeans_random_seeding (varchar rel_source, varchar expr_point, integer k, float8[][] initial_centroids=NULL)
	k-Means Random Seeding More...

kmeans_result	kmeans_random (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
	Run k-Means with random seeding. More...

kmeans_result	kmeans (varchar rel_source, varchar expr_point, varchar rel_initial_centroids, varchar expr_centroid, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001)
	Perform Lloyd's k-means local-search heuristic, but with initial centroids stored in a table. More...

float8	simple_silhouette (varchar rel_source, varchar expr_point, float8[][] centroids, varchar fn_dist="dist_norm2")
	Compute a simplified version of the silhouette coefficient. More...

Detailed Description

See Also: For a brief introduction to k-means clustering, see the module description k-Means Clustering.

Definition in file kmeans.sql_in.

Function Documentation

kmeans_result kmeans	(	varchar	rel_source,
		varchar	expr_point,
		float8	initial_centroids[][],
		varchar	fn_dist = `"squared_dist_norm2"`,
		varchar	agg_centroid = `"avg"`,
		integer	max_num_iterations = `20`,
		float8	min_frac_reassigned = `0.001`
	)

Parameters

rel_source	Name of the relation containing input points
expr_point	Expression evaluating to point coordinates for each tuple
initial_centroids	Matrix containing the initial centroids as columns
fn_dist	Name of a function with signature `DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION` that returns the distance between two points. The default is the squared Euclidean distance.
agg_centroid	Name of an aggregate function with signature `DOUBLE PRECISION[] -> DOUBLE PRECISION[]` that, for each group of points, returns a centroid. In order for Lloyd's local-search heuristic to provably converge and to return a local minimum, this centroid should minimize the sum of distances between each point in the group and the centroid. The default is the average (mean/barycenter in Euclidean space), which satisfies this property if `fn_dist = 'squared_dist_norm2'`.
max_num_iterations	Maximum number of iterations
min_frac_reassigned	Fraction of reassigned points below which convergence is assumed and the algorithm terminates

Returns

A composite value:

centroids - Matrix with $ k $ centroids as columns.
frac_reassigned - Fraction of points that were assigned a different centroid in the last iteration.
num_iterations - The number of iterations before the algorithm terminated

Definition at line 348 of file kmeans.sql_in.

kmeans_result kmeans	(	varchar	rel_source,
		varchar	expr_point,
		varchar	rel_initial_centroids,
		varchar	expr_centroid,
		varchar	fn_dist = `"squared_dist_norm2"`,
		varchar	agg_centroid = `"avg"`,
		integer	max_num_iterations = `20`,
		float8	min_frac_reassigned = `0.001`
	)

This is a shortcut for running k-means with initial centroids stored in a table (as opposed to an array of centroids). It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    (SELECT matrix_agg($expr_centroid) FROM $rel_initial_centroids),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

where $expr_centroid and $rel_initial_centroids denote textual substituions.

Definition at line 1093 of file kmeans.sql_in.

kmeans_result kmeans_random	(	varchar	rel_source,
		varchar	expr_point,
		integer	k,
		varchar	fn_dist = `"squared_dist_norm2"`,
		varchar	agg_centroid = `"avg"`,
		integer	max_num_iterations = `20`,
		float8	min_frac_reassigned = `0.001`
	)

This is a shortcut for running k-means with random seeding. It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    kmeans_random_seeding(
        rel_source,
        expr_point,
        k
    ),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

Definition at line 910 of file kmeans.sql_in.

float8 [][] kmeans_random_seeding	(	varchar	rel_source,
		varchar	expr_point,
		integer	k,
		float8	initial_centroids[][] = `NULL`
	)

Parameters

rel_source	Name of the relation containing input points
expr_point	Expression evaluating to point coordinates for each tuple
k	Number of centroids
initial_centroids	A matrix containing up to $ k $ columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all $ k $ centroids will be generated.

Returns: A matrix containing $ k $ centroids as columns

Definition at line 804 of file kmeans.sql_in.

kmeans_result kmeanspp	(	varchar	rel_source,
		varchar	expr_point,
		integer	k,
		varchar	fn_dist = `"squared_dist_norm2"`,
		varchar	agg_centroid = `"avg"`,
		integer	max_num_iterations = `20`,
		float8	min_frac_reassigned = `0.001`
	)

This is a shortcut for running k-means++. It is equivalent to

SELECT kmeans(
    rel_source,
    expr_point,
    kmeanspp_seeding(
        rel_source,
        expr_point,
        k,
        fn_dist
    ),
    fn_dist,
    agg_centroid,
    max_num_iterations,
    min_frac_reassigned
)

Definition at line 668 of file kmeans.sql_in.

float8 [][] kmeanspp_seeding	(	varchar	rel_source,
		varchar	expr_point,
		integer	k,
		varchar	fn_dist = `"squared_dist_norm2"`,
		float8	initial_centroids[][] = `NULL`
	)

Parameters

rel_source	Name of the relation containing input points
expr_point	Expression evaluating to point coordinates for each tuple
k	Number of centroids
fn_dist	Name of a function with signature `DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION` that returns the distance between two points
initial_centroids	A matrix containing up to $ k $ columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all $ k $ centroids will be generated.

Returns: A matrix containing $ k $ centroids as columns

Definition at line 544 of file kmeans.sql_in.

float8 simple_silhouette	(	varchar	rel_source,
		varchar	expr_point,
		float8	centroids[][],
		varchar	fn_dist = `"dist_norm2"`
	)

Parameters

rel_source	Name of the relation containing input points
expr_point	Expression evaluating to point coordinates $ x_i $ for each tuple
centroids	Matrix $ M = (\vec{m_0} \dots \vec{m_{k-1}}) \in \mathbb{R}^{d \times k} $ with $ k $ columns, where column $ i $ contains the position of centroid $ i $.
fn_dist	Name of a function with signature `DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION` that returns the distance between two points

Returns: For each point $ x_i $, let $ d_1( x_i ) $ and $ d_2( x_i ) $ be the distance to the closest and 2nd-closest centroid, respectively. If there is more than one closest centroids then $ d_1( x_i ) = d_2( x_i )$. The return value is the average, over all points $ x_i $, of
\[ \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)}, \]
where 0/0 is interpreted as 0. Clearly, the simplified silhouette coefficient assumes values in $ [0,1] $.

Definition at line 1230 of file kmeans.sql_in.

Functions

Detailed Description

Function Documentation