MADlib
1.0 A newer version is available
User Documentation
|
Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. In the k-means variant, one is given \( n \) points \( x_1, \dots, x_n \in \mathbb R^d \), and the goal is to position \( k \) centroids \( c_1, \dots, c_k \in \mathbb R^d \) so that the sum of distances between each point and its closest centroid is minimized. Each centroid represents a cluster that consists of all points to which this centroid is closest. Formally, we wish to minimize the following objective function:
\[ (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j) \]
In the most common case, \( \operatorname{dist} \) is the square of the Euclidean distance.
This problem is computationally difficult (NP-hard), yet the local-search heuristic proposed by Lloyd [4] performs reasonably well in practice. In fact, it is so ubiquitous today that it is often referred to as the standard algorithm or even just the k-means algorithm [1]. It works as follows:
Since the objective function decreases in every step, this algorithm is guaranteed to converge to a local optimum.
Data points and predefined centroids (if used) are expected to be stored row-wise, in a column of type SVEC
(or any type convertible to SVEC
, like FLOAT[]
or INTEGER[]
). Data points with non-finite values (NULL, NaN, infinity) in any component will be skipped during analysis.
The following methods are available for the centroid seeding:
The following distance functions can be used (computation of barycenter/mean in parentheses):
The following aggregate functions for determining centroids can be used:
The algorithm stops when one of the following conditions is met:
A popular method to assess the quality of the clustering is the silhouette coefficient, a simplified version of which is provided as part of the k-means module. Note that for large data sets, this computation is expensive.
{TABLE|VIEW} rel_source ( ... expr_points FLOAT8[], ... )where:
svec
or INTEGER[]
are implicitly converted to FLOAT8[]
.If kmeans is called with a set of initial centroids, the centroid relation is expected to be of the following form:
{TABLE|VIEW} rel_initial_centroids ( ... expr_centroid DOUBLE PRECISION[], ... )
where:
SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ] );
SELECT * FROM kmeanspp( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ] );
SELECT * FROM kmeans( 'rel_source', 'expr_point', 'rel_initial_centroids', 'expr_centroid', [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ] );---------— OR ------------—
SELECT * FROM kmeans( 'rel_source', 'expr_point', initial_centroids, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ] );where:
DOUBLE PRECISION[][]
.The output of the k-means module is a table that includes the final centroid positions (DOUBLE PRECISION[][]), the objective function, the fraction of reassigned points in the last iteration, and the number of total iterations:
centroids | objective_fn | frac_reassigned | num_iterations ----------------------------------+------------------+-----------------+---------------- ...
SVEC
.[1] Wikipedia, K-means Clustering, http://en.wikipedia.org/wiki/K-means_clustering
[2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07), pp. 1027-1035, http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf
[3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and Evolutionary Search. In: Studies in Computational Intelligence - Hybrid Evolutionary Algorithms. pp. 313-335. Springer. 2007.
[4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell Laboratories. Published much later in: IEEE Transactions on Information Theory 28(2), pp. 128-137. 1982.
[5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational Statistics and Data Analysis, 51(2). pp. 526-544. 2006.