MADlib
1.1 A newer version is available
User Documentation
|
Set of functions for k-means clustering. More...
Go to the source code of this file.
Functions | |
kmeans_result | kmeans (varchar rel_source, varchar expr_point, float8[][] initial_centroids, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001) |
Perform Lloyd's k-means local-search heuristic. More... | |
float8[][] | kmeanspp_seeding (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", float8[][] initial_centroids=NULL) |
k-Means++ Seeding More... | |
kmeans_result | kmeanspp (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001) |
Run k-Means++. More... | |
float8[][] | kmeans_random_seeding (varchar rel_source, varchar expr_point, integer k, float8[][] initial_centroids=NULL) |
k-Means Random Seeding More... | |
kmeans_result | kmeans_random (varchar rel_source, varchar expr_point, integer k, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001) |
Run k-Means with random seeding. More... | |
kmeans_result | kmeans (varchar rel_source, varchar expr_point, varchar rel_initial_centroids, varchar expr_centroid, varchar fn_dist="squared_dist_norm2", varchar agg_centroid="avg", integer max_num_iterations=20, float8 min_frac_reassigned=0.001) |
Perform Lloyd's k-means local-search heuristic, but with initial centroids stored in a table. More... | |
float8 | simple_silhouette (varchar rel_source, varchar expr_point, float8[][] centroids, varchar fn_dist="dist_norm2") |
Compute a simplified version of the silhouette coefficient. More... | |
Definition in file kmeans.sql_in.
kmeans_result kmeans | ( | varchar | rel_source, |
varchar | expr_point, | ||
float8 | initial_centroids[][], | ||
varchar | fn_dist = "squared_dist_norm2" , |
||
varchar | agg_centroid = "avg" , |
||
integer | max_num_iterations = 20 , |
||
float8 | min_frac_reassigned = 0.001 |
||
) |
rel_source | Name of the relation containing input points |
expr_point | Expression evaluating to point coordinates for each tuple |
initial_centroids | Matrix containing the initial centroids as columns |
fn_dist | Name of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points. The default is the squared Euclidean distance. |
agg_centroid | Name of an aggregate function with signature DOUBLE PRECISION[] -> DOUBLE PRECISION[] that, for each group of points, returns a centroid. In order for Lloyd's local-search heuristic to provably converge and to return a local minimum, this centroid should minimize the sum of distances between each point in the group and the centroid. The default is the average (mean/barycenter in Euclidean space), which satisfies this property if fn_dist = 'squared_dist_norm2' . |
max_num_iterations | Maximum number of iterations |
min_frac_reassigned | Fraction of reassigned points below which convergence is assumed and the algorithm terminates |
centroids
- Matrix with \( k \) centroids as columns.frac_reassigned
- Fraction of points that were assigned a different centroid in the last iteration.num_iterations
- The number of iterations before the algorithm terminated Definition at line 348 of file kmeans.sql_in.
kmeans_result kmeans | ( | varchar | rel_source, |
varchar | expr_point, | ||
varchar | rel_initial_centroids, | ||
varchar | expr_centroid, | ||
varchar | fn_dist = "squared_dist_norm2" , |
||
varchar | agg_centroid = "avg" , |
||
integer | max_num_iterations = 20 , |
||
float8 | min_frac_reassigned = 0.001 |
||
) |
This is a shortcut for running k-means with initial centroids stored in a table (as opposed to an array of centroids). It is equivalent to
SELECT kmeans( rel_source, expr_point, (SELECT matrix_agg($expr_centroid) FROM $rel_initial_centroids), fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
where $expr_centroid
and $rel_initial_centroids
denote textual substituions.
Definition at line 1093 of file kmeans.sql_in.
kmeans_result kmeans_random | ( | varchar | rel_source, |
varchar | expr_point, | ||
integer | k, | ||
varchar | fn_dist = "squared_dist_norm2" , |
||
varchar | agg_centroid = "avg" , |
||
integer | max_num_iterations = 20 , |
||
float8 | min_frac_reassigned = 0.001 |
||
) |
This is a shortcut for running k-means with random seeding. It is equivalent to
SELECT kmeans( rel_source, expr_point, kmeans_random_seeding( rel_source, expr_point, k ), fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
Definition at line 910 of file kmeans.sql_in.
float8 [][] kmeans_random_seeding | ( | varchar | rel_source, |
varchar | expr_point, | ||
integer | k, | ||
float8 | initial_centroids[][] = NULL |
||
) |
rel_source | Name of the relation containing input points |
expr_point | Expression evaluating to point coordinates for each tuple |
k | Number of centroids |
initial_centroids | A matrix containing up to \( k \) columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all \( k \) centroids will be generated. |
Definition at line 804 of file kmeans.sql_in.
kmeans_result kmeanspp | ( | varchar | rel_source, |
varchar | expr_point, | ||
integer | k, | ||
varchar | fn_dist = "squared_dist_norm2" , |
||
varchar | agg_centroid = "avg" , |
||
integer | max_num_iterations = 20 , |
||
float8 | min_frac_reassigned = 0.001 |
||
) |
This is a shortcut for running k-means++. It is equivalent to
SELECT kmeans( rel_source, expr_point, kmeanspp_seeding( rel_source, expr_point, k, fn_dist ), fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned )
Definition at line 668 of file kmeans.sql_in.
float8 [][] kmeanspp_seeding | ( | varchar | rel_source, |
varchar | expr_point, | ||
integer | k, | ||
varchar | fn_dist = "squared_dist_norm2" , |
||
float8 | initial_centroids[][] = NULL |
||
) |
rel_source | Name of the relation containing input points |
expr_point | Expression evaluating to point coordinates for each tuple |
k | Number of centroids |
fn_dist | Name of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points |
initial_centroids | A matrix containing up to \( k \) columns as columns. kmeanspp_seeding() proceeds exactly as if these centroids had already been generated in previous iterations. This parameter may be NULL in which all \( k \) centroids will be generated. |
Definition at line 544 of file kmeans.sql_in.
float8 simple_silhouette | ( | varchar | rel_source, |
varchar | expr_point, | ||
float8 | centroids[][], | ||
varchar | fn_dist = "dist_norm2" |
||
) |
rel_source | Name of the relation containing input points |
expr_point | Expression evaluating to point coordinates \( x_i \) for each tuple |
centroids | Matrix \( M = (\vec{m_0} \dots \vec{m_{k-1}}) \in \mathbb{R}^{d \times k} \) with \( k \) columns, where column \( i \) contains the position of centroid \( i \). |
fn_dist | Name of a function with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION that returns the distance between two points |
\[ \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)}, \]
where 0/0 is interpreted as 0. Clearly, the simplified silhouette coefficient assumes values in \( [0,1] \).Definition at line 1230 of file kmeans.sql_in.