User Documentation
 All Files Functions Groups
kmeans.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file kmeans.sql_in
4  *
5  * @brief Set of functions for k-means clustering.
6  *
7  * @sa For a brief introduction to k-means clustering, see the module
8  * description \ref grp_kmeans.
9  *
10  *//* ----------------------------------------------------------------------- */
11 
12 m4_include(`SQLCommon.m4')
13 
14 /**
15 @addtogroup grp_kmeans
16 
17 @about
18 
19 Clustering refers to the problem of partitioning a set of objects according to
20 some problem-dependent measure of <em>similarity</em>. In the k-means variant,
21 one is given \f$ n \f$ points \f$ x_1, \dots, x_n \in \mathbb R^d \f$, and the
22 goal is to position \f$ k \f$ centroids \f$ c_1, \dots, c_k \in \mathbb R^d \f$
23 so that the sum of \em distances between each point and its closest centroid is
24 minimized. Each centroid represents a cluster that consists of all points to
25 which this centroid is closest. Formally, we wish to minimize the following
26 objective function:
27 \f[
28  (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j)
29 \f]
30 In the most common case, \f$ \operatorname{dist} \f$ is the square of the
31 Euclidean distance.
32 
33 This problem is computationally difficult (NP-hard), yet the
34 local-search heuristic proposed by Lloyd [4] performs reasonably well in
35 practice. In fact, it is so ubiquitous today that it is
36 often referred to as the <em>standard algorithm</em> or even just the
37 <em>k-means algorithm</em> [1]. It works as follows:
38 
39 -# Seed the \f$ k \f$ centroids (see below)
40 -# Repeat until convergence:
41  -# Assign each point to its closest centroid
42  -# Move each centroid to a position that minimizes the sum of distances in this
43  cluster
44 -# Convergence is achieved when no points change their assignments during step
45  2a.
46 
47 Since the objective function decreases in every step, this algorithm is
48 guaranteed to converge to a local optimum.
49 
50 @implementation
51 
52 Data points and predefined centroids (if used) are expected to be stored row-wise,
53 in a column of type <tt>\ref grp_svec "SVEC"</tt> (or any type convertible to
54 <tt>\ref grp_svec "SVEC"</tt>, like <tt>FLOAT[]</tt> or <tt>INTEGER[]</tt>).
55 Data points with non-finite values (NULL, NaN, infinity) in any component
56 will be skipped during analysis.
57 
58 The following methods are available for the centroid seeding:
59  - <strong>random selection</strong>:
60  Select \f$ k \f$ centroids randomly among the input points.
61  - <strong>kmeans++</strong> [2]:
62  Start with a single centroid chosen randomly among the input points. Then
63  iteratively choose new
64  centroids from the input points until there is a total of \f$ k \f$
65  centroids. The probability for picking a particular point is proportional to
66  its minimum distance to any existing centroid.
67  \n
68  Intuitively, kmeans++ favors seedings where centroids are spread out over the
69  whole range of the input points, while at the same time not being too
70  susceptible to outliers [2].
71  - <strong>user-specified set of initial centroids</strong>:
72  See below for a description of the expected format of the set of initial
73  centroids.
74 
75 The following distance functions can be used (computation of barycenter/mean in parentheses):
76  - <strong>\ref dist_norm1</strong>: 1-norm/Manhattan (element-wise median [Note that MADlib does not provide a median
77  aggregate function for support and performance reasons.])
78  - <strong>\ref dist_norm2</strong>: 2-norm/Euclidean (element-wise mean)
79  - <strong>\ref squared_dist_norm2</strong>: squared Euclidean distance (element-wise mean)
80  - <strong>\ref dist_angle</strong>: angle (element-wise mean of normalized points)
81  - <strong>\ref dist_tanimoto</strong>: tanimoto (element-wise mean of normalized points [5])
82  - <strong>user defined function</strong> with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION
83 
84 The following aggregate functions for determining centroids can be used:
85  - <strong>\ref avg</strong>: average
86  - <strong>\ref normalized_avg</strong>: normalized average
87 
88 The algorithm stops when one of the following conditions is met:
89  - The fraction of updated points is smaller than the convergence threshold (default: 0.001).
90  - The algorithm reaches the maximum number of allowed iterations (default: 20).
91 
92 A popular method to assess the quality of the clustering is the
93 <em>silhouette coefficient</em>, a simplified version of which is provided as part of the k-means module.
94 Note that for large data sets, this computation is expensive.
95 
96 @input
97 The <strong>source relation</strong> is expected to be of the following form (or
98 to be implicitly convertible into the following form):
99 <pre>{TABLE|VIEW} <em>rel_source</em> (
100  ...
101  <em>expr_points</em> FLOAT8[],
102  ...
103 )</pre>
104 where:
105  - <em>expr_points</em> is the name of a column with point coordinates.
106  Types such as \c svec or <tt>INTEGER[]</tt> are implicitly converted to
107  <tt>FLOAT8[]</tt>.
108 
109 If kmeans is called with a set of initial centroids, the centroid relation is
110 expected to be of the following form:
111 <pre>{TABLE|VIEW} <em>rel_initial_centroids</em> (
112  ...
113  <em>expr_centroid</em> DOUBLE PRECISION[],
114  ...
115 )</pre>
116 where:
117  - <em>expr_centroid</em> is the name of a column with coordinates.
118 
119 @usage
120 The k-means algorithm can be invoked in four possible ways:
121 
122 - using <em>random</em> centroid seeding method for a
123 provided \f$ k \f$:
124 <pre>SELECT * FROM \ref kmeans_random(
125  '<em>rel_source</em>', '<em>expr_point</em>', k,
126  [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
127  <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
128 );</pre>
129 
130 - using <em>kmeans++</em> centroid seeding method for a
131 provided \f$ k \f$:
132 <pre>SELECT * FROM \ref kmeanspp(
133  '<em>rel_source</em>', '<em>expr_point</em>', k,
134  [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
135  <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
136 );</pre>
137 
138 - with a provided centroid set:
139 <pre>SELECT * FROM \ref kmeans(
140  '<em>rel_source</em>', '<em>expr_point</em>',
141  '<em>rel_initial_centroids</em>', '<em>expr_centroid</em>',
142  [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
143  <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
144 );</pre>
145 ------------ OR ---------------
146 <pre>SELECT * FROM \ref kmeans(
147  '<em>rel_source</em>', '<em>expr_point</em>',
148  initial_centroids,
149  [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
150  <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
151 );</pre>
152 where:
153  - <em>initial_centroids</em> is of type <tt>DOUBLE PRECISION[][]</tt>.
154 
155 The output of the k-means module is a table that includes the final
156 centroid positions (DOUBLE PRECISION[][]), the objective function,
157 the fraction of reassigned points in the last iteration, and
158 the number of total iterations:
159 <pre>
160  centroids | objective_fn | frac_reassigned | num_iterations
161 ----------------------------------+------------------+-----------------+----------------
162  ...
163 </pre>
164 
165 @examp
166 
167 -# Prepare some input data.
168 \code
169 sql> SELECT * FROM public.km_sample LIMIT 5;
170  points
171 -------------------------------------------
172  {1,1}:{15.8822241332382,105.945462542586}
173  {1,1}:{34.5065216883086,72.3126099305227}
174  {1,1}:{22.5074400822632,95.3209559689276}
175  {1,1}:{70.2589857042767,68.7395178806037}
176  {1,1}:{30.9844257542863,25.3213323024102}
177 (5 rows)
178 \endcode
179 Note: the example <em>points</em> is type <tt>\ref grp_svec "SVEC"</tt>.
180 -# Run k-means clustering using kmeans++ for centroid seeding:
181 \code
182 sql> SELECT * FROM madlib.kmeanspp('km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001);
183 );
184  centroids | objective_fn | frac_reassigned | num_iterations
185 -------------------------------------------------------------------------+------------------+-----------------+----------------
186  {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | 5
187 \endcode
188 -# Calculate the simplified silhouette coefficient:
189 \code
190 sql> SELECT * from madlib.simple_silhouette('km_test_svec','points',
191  (select centroids from madlib.kmeanspp('km_test_svec','points',2,'madlib.squared_dist_norm2','madlib.avg',20,0.001)),
192  'madlib.dist_norm2');
193  simple_silhouette
194 -------------------
195  0.611022970398174
196 \endcode
197 
198 @literature
199 
200 [1] Wikipedia, K-means Clustering,
201  http://en.wikipedia.org/wiki/K-means_clustering
202 
203 [2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful
204  seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete
205  Algorithms (SODA'07), pp. 1027-1035,
206  http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf
207 
208 [3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering
209  Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and
210  Evolutionary Search. In: Studies in Computational Intelligence - Hybrid
211  Evolutionary Algorithms. pp. 313-335. Springer. 2007.
212 
213 [4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell
214  Laboratories. Published much later in: IEEE Transactions on Information
215  Theory 28(2), pp. 128-137. 1982.
216 
217 [5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational
218  Statistics and Data Analysis, 51(2). pp. 526-544. 2006.
219 
220 @sa File kmeans.sql_in documenting the SQL functions.
221 
222 @internal
223 @sa namespace kmeans (documenting the implementation in Python)
224 @endinternal
225 */
226 
227 /*
228  * @brief k-Means return type
229  *
230  * A composite value:
231  * - <tt>centroids</tt> - Matrix containing the new \f$ l \leq k \f$
232  * repositioned centroids as columns. If this matrix has \f$ l < k \f$
233  * columns, one or more old centroids no longer were closest to any point.
234  * - <tt>old_centroid_its</tt> - The order of the centroids in
235  * <tt>centroid</tt> is not guaranteed to be consitent across iterations.
236  * In particular, if a centroid is no longer closest to any point it can be
237  * dropped and a new centroid is added afterwards. We therefore need to map
238  * positions in <tt>centroids</tt> to the respective positions in the
239  * previous iteration.
240  * - <tt>objective_fn</tt> - Value of the objective function, i.e.,
241  * \f$ \sum_{x \in P} \dist(x, C)^2 \f$ where
242  * \f$ P \f$ is the set of points, \f$ C \f$ is the set of centroids, and
243  * \f$ \dist(x, C) := \min_{c \in C} \operatorname{dist}(x, c) \f$.
244  * - <tt>frac_reassigned</tt> - Fraction of points that was assigned a
245  * different centroid in the current iteration.
246  * - <tt>num_iterations</tt> - Number of iterations performed (so far).
247  */
248 CREATE TYPE MADLIB_SCHEMA.kmeans_result AS (
249  centroids DOUBLE PRECISION[][],
250  objective_fn DOUBLE PRECISION,
251  frac_reassigned DOUBLE PRECISION,
252  num_iterations INTEGER
253 );
254 
255 /*
256  * @brief k-Means inter-iteration state type
257  *
258  * A composite value like \ref{kmeans_result}. Additional fields:
259  * - <tt>old_centroid_its</tt> - The order of the centroids in
260  * <tt>centroid</tt> is not guaranteed to be consitent across iterations.
261  * In particular, if a centroid is no longer closest to any point it can be
262  * dropped and a new centroid is added afterwards. We therefore need to map
263  * positions in <tt>centroids</tt> to the respective positions in the
264  * previous iteration.
265  * - <tt>num_iterations</tt> - Number of iterations performed (so far).
266  */
267 CREATE TYPE MADLIB_SCHEMA.kmeans_state AS (
268  centroids DOUBLE PRECISION[][],
269  old_centroid_ids INTEGER[],
270  objective_fn DOUBLE PRECISION,
271  frac_reassigned DOUBLE PRECISION
272 );
273 
274 /**
275  * @internal
276  * @brief Execute a SQL command where $1, ..., $4 are substituted with the
277  * given arguments.
278  */
279 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args(
280  sql VARCHAR, DOUBLE PRECISION[][], REGPROC, INTEGER, DOUBLE PRECISION
281 ) RETURNS VOID
282 VOLATILE
283 CALLED ON NULL INPUT
284 LANGUAGE c
285 AS 'MODULE_PATHNAME', 'exec_sql_using';
286 
287 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans(
288  rel_args VARCHAR,
289  rel_state VARCHAR,
290  rel_source VARCHAR,
291  expr_point VARCHAR,
292  agg_centroid VARCHAR)
293 RETURNS INTEGER
294 VOLATILE
295 LANGUAGE plpythonu
296 AS $$PythonFunction(kmeans, kmeans, compute_kmeans)$$;
297 
298 /**
299  * @brief Perform Lloyd's k-means local-search heuristic
300  *
301  * @param rel_source Name of the relation containing input points
302  * @param expr_point Expression evaluating to point coordinates for each tuple
303  * @param initial_centroids Matrix containing the initial centroids as columns
304  * @param fn_dist Name of a function with signature
305  * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
306  * returns the distance between two points. The default is the
307  * \ref squared_dist_norm2(float8[],float8[]) "squared Euclidean distance".
308  * @param agg_centroid Name of an aggregate function with signature
309  * <tt>DOUBLE PRECISION[] -> DOUBLE PRECISION[]</tt> that, for each group
310  * of points, returns a centroid. In order for Lloyd's local-search
311  * heuristic to provably converge and to return a local minimum, this
312  * centroid should minimize the sum of distances between each point in the
313  * group and the centroid. The default is the
314  * \ref avg(float8[]) "average (mean/barycenter in Euclidean space)",
315  * which satisfies this property if <tt>fn_dist = 'squared_dist_norm2'</tt>.
316  * @param max_num_iterations Maximum number of iterations
317  * @param min_frac_reassigned Fraction of reassigned points below which
318  * convergence is assumed and the algorithm terminates
319  * @returns A composite value:
320  * - <tt>centroids</tt> - Matrix with \f$ k \f$ centroids as columns.
321  * - <tt>frac_reassigned</tt> - Fraction of points that were assigned a
322  * different centroid in the last iteration.
323  * - <tt>num_iterations</tt> - The number of iterations before the
324  * algorithm terminated
325  */
326 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
327  rel_source VARCHAR,
328  expr_point VARCHAR,
329  initial_centroids DOUBLE PRECISION[][],
330  fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
331  agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
332  max_num_iterations INTEGER /*+ DEFAULT 20 */,
333  min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
334 ) RETURNS MADLIB_SCHEMA.kmeans_result AS $$
335 DECLARE
336  theIteration INTEGER;
337  theResult MADLIB_SCHEMA.kmeans_result;
338  oldClientMinMessages VARCHAR;
339  class_rel_source REGCLASS;
340  proc_fn_dist REGPROCEDURE;
341  proc_agg_centroid REGPROCEDURE;
342  rel_filtered VARCHAR;
343  num_points INTEGER;
344  k INTEGER;
345  centroids FLOAT8[];
346 BEGIN
347  IF (array_upper(initial_centroids,1) IS NULL) THEN
348  RAISE EXCEPTION 'No valid initial centroids given.';
349  END IF;
350 
351  centroids := ARRAY(SELECT unnest(initial_centroids));
352  IF (SELECT MADLIB_SCHEMA.svec_elsum(centroids)) >= 'Infinity'::float THEN
353  RAISE EXCEPTION 'At least one initial centroid has non-finite values.';
354  END IF;
355 
356  rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
357  class_rel_source := rel_filtered;
358  proc_fn_dist := fn_dist
359  || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
360  IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
361  FROM pg_proc WHERE oid = proc_fn_dist) THEN
362  RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
363  END IF;
364  proc_agg_centroid := agg_centroid || '(DOUBLE PRECISION[])';
365  IF (SELECT prorettype != 'DOUBLE PRECISION[]'::regtype OR proisagg = FALSE
366  FROM pg_proc WHERE oid = proc_agg_centroid) THEN
367  RAISE EXCEPTION 'Mean aggregate has wrong signature or is not an aggregate.';
368  END IF;
369  IF (min_frac_reassigned < 0) OR (min_frac_reassigned > 1) THEN
370  RAISE EXCEPTION 'Convergence threshold is not a valid value (must be a fraction between 0 and 1).';
371  END IF;
372  IF (max_num_iterations < 0) THEN
373  RAISE EXCEPTION 'Number of iterations must be a non-negative integer.';
374  END IF;
375 
376  -- Extra parameter check added so that ERROR output is more user-readable (doesn't include Python traceback)
377  k := array_upper(initial_centroids,1);
378  IF (k <= 0) THEN
379  RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
380  END IF;
381  IF (k > 32767) THEN
382  RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
383  END IF;
384  EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ;
385  IF (num_points < k) THEN
386  RAISE EXCEPTION 'Number of centroids is greater than number of points.';
387  END IF;
388 
389  -- We first setup the argument table. Rationale: We want to avoid all data
390  -- conversion between native types and Python code. Instead, we use Python
391  -- as a pure driver layer.
392  PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
393  oldClientMinMessages :=
394  (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
395  EXECUTE 'SET client_min_messages TO warning';
396 
397  -- Unfortunately, the EXECUTE USING syntax is only available starting
398  -- PostgreSQL 8.4:
399  -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
400  -- We therefore have to emulate.
401  PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$
402  DROP TABLE IF EXISTS pg_temp._madlib_kmeans_args;
403  CREATE TABLE pg_temp._madlib_kmeans_args AS
404  SELECT
405  $1 AS initial_centroids, array_upper($1, 1) AS k,
406  $2 AS fn_dist, $3 AS max_num_iterations,
407  $4 AS min_frac_reassigned;
408  $sql$,
409  initial_centroids, proc_fn_dist, max_num_iterations,
410  min_frac_reassigned);
411  EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
412 
413  -- Perform acutal computation.
414  -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
415  -- operators from regclass to varchar/text.
416  theIteration := MADLIB_SCHEMA.internal_compute_kmeans('_madlib_kmeans_args',
417  '_madlib_kmeans_state',
418  textin(regclassout(class_rel_source)), expr_point,
419  textin(regprocout(proc_agg_centroid)));
420 
421  -- Retrieve result from state table and return it
422  EXECUTE
423  $sql$
424  SELECT (_state).centroids, (_state).objective_fn,
425  (_state).frac_reassigned, NULL
426  FROM _madlib_kmeans_state
427  WHERE _iteration = $sql$ || theIteration || $sql$
428  $sql$
429  INTO theResult;
430  -- The number of iterations are not updated in the C++ code. We do it here.
431  IF NOT (theResult IS NULL) THEN
432  theResult.num_iterations = theIteration;
433  END IF;
434  RETURN theResult;
435 END;
436 $$ LANGUAGE plpgsql VOLATILE;
437 
438 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
439  rel_source VARCHAR,
440  expr_point VARCHAR,
441  initial_centroids DOUBLE PRECISION[][],
442  fn_dist VARCHAR,
443  agg_centroid VARCHAR,
444  max_num_iterations INTEGER
445 ) RETURNS MADLIB_SCHEMA.kmeans_result
446 VOLATILE
447 STRICT
448 LANGUAGE sql AS $$
449  SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, $6, 0.001)
450 $$;
451 
452 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
453  rel_source VARCHAR,
454  expr_point VARCHAR,
455  initial_centroids DOUBLE PRECISION[][],
456  fn_dist VARCHAR,
457  agg_centroid VARCHAR
458 ) RETURNS MADLIB_SCHEMA.kmeans_result
459 VOLATILE
460 STRICT
461 LANGUAGE sql AS $$
462  SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, 20, 0.001)
463 $$;
464 
465 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
466  rel_source VARCHAR,
467  expr_point VARCHAR,
468  initial_centroids DOUBLE PRECISION[][],
469  fn_dist VARCHAR
470 ) RETURNS MADLIB_SCHEMA.kmeans_result
471 VOLATILE
472 STRICT
473 LANGUAGE sql AS $$
474  SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, 'MADLIB_SCHEMA.avg', 20,
475  0.001)
476 $$;
477 
478 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
479  rel_source VARCHAR,
480  expr_point VARCHAR,
481  initial_centroids DOUBLE PRECISION[][]
482 ) RETURNS MADLIB_SCHEMA.kmeans_result
483 VOLATILE
484 STRICT
485 LANGUAGE sql AS $$
486  SELECT MADLIB_SCHEMA.kmeans($1, $2, $3,
487  'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001)
488 $$;
489 
490 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args(
491  sql VARCHAR, INTEGER, REGPROC, DOUBLE PRECISION[][]
492 ) RETURNS VOID
493 VOLATILE
494 CALLED ON NULL INPUT
495 LANGUAGE c
496 AS 'MODULE_PATHNAME', 'exec_sql_using';
497 
498 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeanspp_seeding(
499  rel_args VARCHAR,
500  rel_state VARCHAR,
501  rel_source VARCHAR,
502  expr_point VARCHAR)
503 RETURNS INTEGER
504 AS $$PythonFunction(kmeans, kmeans, compute_kmeanspp_seeding)$$
505 LANGUAGE plpythonu VOLATILE;
506 
507 /**
508  * @brief k-Means++ Seeding
509  *
510  * @param rel_source Name of the relation containing input points
511  * @param expr_point Expression evaluating to point coordinates for each tuple
512  * @param k Number of centroids
513  * @param fn_dist Name of a function with signature
514  * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
515  * returns the distance between two points
516  * @param initial_centroids A matrix containing up to \f$ k \f$ columns as
517  * columns. kmeanspp_seeding() proceeds exactly as if these centroids had
518  * already been generated in previous iterations. This parameter may be
519  * NULL in which all \f$ k \f$ centroids will be generated.
520  * @returns A matrix containing \f$ k \f$ centroids as columns
521  */
522 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
523  rel_source VARCHAR,
524  expr_point VARCHAR,
525  k INTEGER,
526  fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
527  initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */
528 ) RETURNS DOUBLE PRECISION[][] AS $$
529 DECLARE
530  theIteration INTEGER;
531  theResult DOUBLE PRECISION[][];
532  oldClientMinMessages VARCHAR;
533  class_rel_source REGCLASS;
534  proc_fn_dist REGPROCEDURE;
535  num_points INTEGER;
536  num_centroids INTEGER;
537  rel_filtered VARCHAR;
538 BEGIN
539  rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
540  class_rel_source := rel_filtered;
541 
542  IF (initial_centroids IS NOT NULL) THEN
543  num_centroids := array_upper(initial_centroids,1);
544  ELSE
545  num_centroids := k;
546  END IF;
547 
548  proc_fn_dist := fn_dist
549  || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
550  IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
551  FROM pg_proc WHERE oid = proc_fn_dist) THEN
552  RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
553  END IF;
554  IF (k <= 0) THEN
555  RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
556  END IF;
557  IF (k > 32767) THEN
558  RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
559  END IF;
560  EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ;
561  IF (num_points < k OR num_points < num_centroids) THEN
562  RAISE EXCEPTION 'Number of centroids is greater than number of points.';
563  END IF;
564  IF (k < num_centroids) THEN
565  RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.';
566  END IF;
567 
568  -- We first setup the argument table. Rationale: We want to avoid all data
569  -- conversion between native types and Python code. Instead, we use Python
570  -- as a pure driver layer.
571  oldClientMinMessages :=
572  (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
573  EXECUTE 'SET client_min_messages TO warning';
574  PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
575  -- Unfortunately, the EXECUTE USING syntax is only available starting
576  -- PostgreSQL 8.4:
577  -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
578  -- We therefore have to emulate.
579  PERFORM MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args($sql$
580  DROP TABLE IF EXISTS pg_temp._madlib_kmeanspp_args;
581  CREATE TEMPORARY TABLE _madlib_kmeanspp_args AS
582  SELECT $1 AS k, $2 AS fn_dist, $3 AS initial_centroids;
583  $sql$,
584  k, proc_fn_dist, initial_centroids);
585  EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
586 
587  -- Perform acutal computation.
588  -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
589  -- operators from regclass to varchar/text.
590  theIteration := (
591  SELECT MADLIB_SCHEMA.internal_compute_kmeanspp_seeding(
592  '_madlib_kmeanspp_args', '_madlib_kmeanspp_state',
593  textin(regclassout(class_rel_source)), expr_point)
594  );
595 
596  -- Retrieve result from state table and return it
597  EXECUTE
598  $sql$
599  SELECT _state FROM _madlib_kmeanspp_state
600  WHERE _iteration = $sql$ || theIteration || $sql$
601  $sql$
602  INTO theResult;
603  RETURN theResult;
604 END;
605 $$ LANGUAGE plpgsql VOLATILE;
606 
607 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
608  rel_source VARCHAR,
609  expr_point VARCHAR,
610  k INTEGER,
611  fn_dist VARCHAR
612 ) RETURNS DOUBLE PRECISION[][]
613 LANGUAGE sql AS $$
614  SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4, NULL)
615 $$;
616 
617 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
618  rel_source VARCHAR,
619  expr_point VARCHAR,
620  k INTEGER
621 ) RETURNS DOUBLE PRECISION[][]
622 LANGUAGE sql AS $$
623  SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3,
624  'MADLIB_SCHEMA.squared_dist_norm2', NULL)
625 $$;
626 
627 /**
628  * @brief Run k-Means++.
629  *
630  * This is a shortcut for running k-means++. It is equivalent to
631  * <pre>SELECT \ref kmeans(
632  rel_source,
633  expr_point,
634  \ref kmeanspp_seeding(
635  rel_source,
636  expr_point,
637  k,
638  fn_dist
639  ),
640  fn_dist,
641  agg_centroid,
642  max_num_iterations,
643  min_frac_reassigned
644 )</pre>
645  */
646 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
647  rel_source VARCHAR,
648  expr_point VARCHAR,
649  k INTEGER,
650  fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
651  agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
652  max_num_iterations INTEGER /*+ DEFAULT 20 */,
653  min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
654 ) RETURNS MADLIB_SCHEMA.kmeans_result
655 VOLATILE
656 STRICT
657 LANGUAGE plpgsql
658 AS $$
659 DECLARE
660  ret MADLIB_SCHEMA.kmeans_result;
661 BEGIN
662  ret = MADLIB_SCHEMA.kmeans(
663  $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
664  $4, $5, $6, $7);
665  RETURN ret;
666 END
667 $$;
669 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
670  rel_source VARCHAR,
671  expr_point VARCHAR,
672  k INTEGER,
673  fn_dist VARCHAR,
674  agg_centroid VARCHAR,
675  max_num_iterations INTEGER
676 ) RETURNS MADLIB_SCHEMA.kmeans_result
677 VOLATILE
678 STRICT
679 LANGUAGE plpgsql
680 AS $$
681 DECLARE
682  ret MADLIB_SCHEMA.kmeans_result;
683 BEGIN
684  ret = MADLIB_SCHEMA.kmeans(
685  $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
686  $4, $5, $6, 0.001);
687  RETURN ret;
688 END
689 $$;
690 
691 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
692  rel_source VARCHAR,
693  expr_point VARCHAR,
694  k INTEGER,
695  fn_dist VARCHAR,
696  agg_centroid VARCHAR
697 ) RETURNS MADLIB_SCHEMA.kmeans_result
698 VOLATILE
699 STRICT
700 LANGUAGE plpgsql
701 AS $$
702 DECLARE
703  ret MADLIB_SCHEMA.kmeans_result;
704 BEGIN
705  ret = MADLIB_SCHEMA.kmeans(
706  $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
707  $4, $5, 20, 0.001);
708  RETURN ret;
709 END
710 $$;
711 
712 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
713  rel_source VARCHAR,
714  expr_point VARCHAR,
715  k INTEGER,
716  fn_dist VARCHAR
717 ) RETURNS MADLIB_SCHEMA.kmeans_result
718 VOLATILE
719 STRICT
720 LANGUAGE plpgsql
721 AS $$
722 DECLARE
723  ret MADLIB_SCHEMA.kmeans_result;
724 BEGIN
725  ret = MADLIB_SCHEMA.kmeans(
726  $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
727  $4, 'MADLIB_SCHEMA.avg', 20, 0.001);
728  RETURN ret;
729 END
730 $$;
731 
732 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
733  rel_source VARCHAR,
734  expr_point VARCHAR,
735  k INTEGER
736 ) RETURNS MADLIB_SCHEMA.kmeans_result
737 VOLATILE
738 STRICT
739 LANGUAGE plpgsql
740 AS $$
741 DECLARE
742  ret MADLIB_SCHEMA.kmeans_result;
743 BEGIN
744  ret = MADLIB_SCHEMA.kmeans(
745  $1, $2,
746  MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3,
747  'MADLIB_SCHEMA.squared_dist_norm2'),
748  'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001);
749  RETURN ret;
750 END
751 $$;
752 
753 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args(
754  sql VARCHAR, INTEGER, DOUBLE PRECISION[][]
755 ) RETURNS VOID
756 VOLATILE
757 CALLED ON NULL INPUT
758 LANGUAGE c
759 AS 'MODULE_PATHNAME', 'exec_sql_using';
760 
761 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans_random_seeding(
762  rel_args VARCHAR,
763  rel_state VARCHAR,
764  rel_source VARCHAR,
765  expr_point VARCHAR)
766 RETURNS INTEGER
767 AS $$PythonFunction(kmeans, kmeans, compute_kmeans_random_seeding)$$
768 LANGUAGE plpythonu VOLATILE;
769 
770 /**
771  * @brief k-Means Random Seeding
772  *
773  * @param rel_source Name of the relation containing input points
774  * @param expr_point Expression evaluating to point coordinates for each tuple
775  * @param k Number of centroids
776  * @param initial_centroids A matrix containing up to \f$ k \f$ columns as
777  * columns. kmeanspp_seeding() proceeds exactly as if these centroids had
778  * already been generated in previous iterations. This parameter may be
779  * NULL in which all \f$ k \f$ centroids will be generated.
780  * @returns A matrix containing \f$ k \f$ centroids as columns
781  */
782 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding(
783  rel_source VARCHAR,
784  expr_point VARCHAR,
785  k INTEGER,
786  initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */
787 ) RETURNS DOUBLE PRECISION[][] AS $$
788 DECLARE
789  theIteration INTEGER;
790  theResult DOUBLE PRECISION[][];
791  oldClientMinMessages VARCHAR;
792  class_rel_source REGCLASS;
793  proc_fn_dist REGPROCEDURE;
794  num_points INTEGER;
795  num_centroids INTEGER;
796  rel_filtered VARCHAR;
797 BEGIN
798  rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
799  class_rel_source := rel_filtered;
800 
801  IF (initial_centroids IS NOT NULL) THEN
802  num_centroids := array_upper(initial_centroids,1);
803  ELSE
804  num_centroids := k;
805  END IF;
806 
807  IF (k <= 0) THEN
808  RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
809  END IF;
810  IF (k > 32767) THEN
811  RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
812  END IF;
813  EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points;
814  IF (num_points < k OR num_points < num_centroids) THEN
815  RAISE EXCEPTION 'Number of centroids is greater than number of points.';
816  END IF;
817  IF (k < num_centroids) THEN
818  RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.';
819  END IF;
820 
821  -- We first setup the argument table. Rationale: We want to avoid all data
822  -- conversion between native types and Python code. Instead, we use Python
823  -- as a pure driver layer.
824  oldClientMinMessages :=
825  (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
826  EXECUTE 'SET client_min_messages TO warning';
827  PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
828  -- Unfortunately, the EXECUTE USING syntax is only available starting
829  -- PostgreSQL 8.4:
830  -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
831  -- We therefore have to emulate.
832  PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args($sql$
833  DROP TABLE IF EXISTS pg_temp._madlib_kmeans_random_args;
834  CREATE TEMPORARY TABLE _madlib_kmeans_random_args AS
835  SELECT $1 AS k, $2 AS initial_centroids;
836  $sql$,
837  k, initial_centroids);
838  EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
839 
840  -- Perform acutal computation.
841  -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
842  -- operators from regclass to varchar/text.
843  theIteration := (
844  SELECT MADLIB_SCHEMA.internal_compute_kmeans_random_seeding(
845  '_madlib_kmeans_random_args', '_madlib_kmeans_random_state',
846  textin(regclassout(class_rel_source)), expr_point)
847  );
848 
849  -- Retrieve result from state table and return it
850  EXECUTE
851  $sql$
852  SELECT _state FROM _madlib_kmeans_random_state
853  WHERE _iteration = $sql$ || theIteration || $sql$
854  $sql$
855  INTO theResult;
856  RETURN theResult;
857 END;
858 $$ LANGUAGE plpgsql VOLATILE;
859 
860 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding(
861  rel_source VARCHAR,
862  expr_point VARCHAR,
863  k INTEGER
864 ) RETURNS DOUBLE PRECISION[][]
865 LANGUAGE sql AS $$
866  SELECT MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3, NULL)
867 $$;
868 
869 /**
870  * @brief Run k-Means with random seeding.
871  *
872  * This is a shortcut for running k-means with random seeding. It is equivalent
873  * to
874  * <pre>SELECT \ref kmeans(
875  rel_source,
876  expr_point,
877  \ref kmeans_random_seeding(
878  rel_source,
879  expr_point,
880  k
881  ),
882  fn_dist,
883  agg_centroid,
884  max_num_iterations,
885  min_frac_reassigned
886 )</pre>
887  */
888 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
889  rel_source VARCHAR,
890  expr_point VARCHAR,
891  k INTEGER,
892  fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
893  agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
894  max_num_iterations INTEGER /*+ DEFAULT 20 */,
895  min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
896 ) RETURNS MADLIB_SCHEMA.kmeans_result
897 VOLATILE
898 STRICT
899 LANGUAGE plpgsql
900 AS $$
901 DECLARE
902  ret MADLIB_SCHEMA.kmeans_result;
903 BEGIN
904  ret = MADLIB_SCHEMA.kmeans(
905  $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
906  $4, $5, $6, $7);
907  RETURN ret;
908 END
909 $$;
911 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
912  rel_source VARCHAR,
913  expr_point VARCHAR,
914  k INTEGER,
915  fn_dist VARCHAR,
916  agg_centroid VARCHAR,
917  max_num_iterations INTEGER
918 ) RETURNS MADLIB_SCHEMA.kmeans_result
919 VOLATILE
920 STRICT
921 LANGUAGE plpgsql
922 AS $$
923 DECLARE
924  ret MADLIB_SCHEMA.kmeans_result;
925 BEGIN
926  ret = MADLIB_SCHEMA.kmeans(
927  $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
928  $4, $5, $6, 0.001);
929  RETURN ret;
930 END
931 $$;
932 
933 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
934  rel_source VARCHAR,
935  expr_point VARCHAR,
936  k INTEGER,
937  fn_dist VARCHAR,
938  agg_centroid VARCHAR
939 ) RETURNS MADLIB_SCHEMA.kmeans_result
940 VOLATILE
941 STRICT
942 LANGUAGE plpgsql
943 AS $$
944 DECLARE
945  ret MADLIB_SCHEMA.kmeans_result;
946 BEGIN
947  ret = MADLIB_SCHEMA.kmeans(
948  $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
949  $4, $5, 20, 0.001);
950  RETURN ret;
951 END
952 $$;
953 
954 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
955  rel_source VARCHAR,
956  expr_point VARCHAR,
957  k INTEGER,
958  fn_dist VARCHAR
959 ) RETURNS MADLIB_SCHEMA.kmeans_result
960 VOLATILE
961 STRICT
962 LANGUAGE plpgsql
963 AS $$
964 DECLARE
965  ret MADLIB_SCHEMA.kmeans_result;
966 BEGIN
967  ret = MADLIB_SCHEMA.kmeans(
968  $1, $2,
969  MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
970  $4, 'MADLIB_SCHEMA.avg', 20, 0.001);
971  RETURN ret;
972 END
973 $$;
974 
975 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
976  rel_source VARCHAR,
977  expr_point VARCHAR,
978  k INTEGER
979 ) RETURNS MADLIB_SCHEMA.kmeans_result
980 VOLATILE
981 STRICT
982 LANGUAGE plpgsql
983 AS $$
984 DECLARE
985  ret MADLIB_SCHEMA.kmeans_result;
986 BEGIN
987  ret = MADLIB_SCHEMA.kmeans(
988  $1, $2,
989  MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
990  'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001);
991  RETURN ret;
992 END
993 $$;
994 
995 /**
996  * @internal
997  * @brief Execute a SQL command where $1, ..., $6 are substituted with the
998  * given arguments.
999  */
1000 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args(
1001  sql VARCHAR, rel_source VARCHAR, expr_point VARCHAR,
1002  fn_dist VARCHAR, agg_centroid VARCHAR, max_num_iterations INTEGER,
1003  min_frac_reassigned DOUBLE PRECISION
1004 ) RETURNS MADLIB_SCHEMA.kmeans_result
1005 VOLATILE
1006 CALLED ON NULL INPUT
1007 LANGUAGE c
1008 AS 'MODULE_PATHNAME', 'exec_sql_using';
1009 
1010 /**
1011  * @internal
1012  * @brief Filter out the invalid data points in the original input relation
1013  */
1014 CREATE FUNCTION MADLIB_SCHEMA.__filter_input_relation(
1015  rel_source VARCHAR, expr_point VARCHAR)
1016 RETURNS VARCHAR
1017 AS $$
1018 DECLARE
1019  oldClientMinMessages VARCHAR;
1020  rel_source_filtered VARCHAR;
1021 BEGIN
1022  IF (SELECT position('.' in rel_source)) > 0 THEN
1023  rel_source_filtered := '_madlib_' || split_part(rel_source, '.', 2) || '_filtered';
1024  ELSE
1025  rel_source_filtered := '_madlib_' || rel_source || '_filtered';
1026  END IF;
1027 
1028  oldClientMinMessages :=
1029  (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
1030  EXECUTE 'SET client_min_messages TO warning';
1031  EXECUTE 'DROP VIEW IF EXISTS _madlib_'||rel_source_filtered||'_filtered';
1032  EXECUTE 'DROP VIEW IF EXISTS '||rel_source_filtered;
1033  EXECUTE 'CREATE TEMP VIEW '||rel_source_filtered||'
1034  AS SELECT * FROM '||rel_source||'
1035  WHERE abs(
1036  coalesce(
1037  MADLIB_SCHEMA.svec_elsum('||expr_point||'),
1038  ''Infinity''::FLOAT8
1039  )
1040  ) < ''Infinity''::FLOAT8';
1041  EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
1042  RETURN rel_source_filtered;
1043  EXCEPTION
1044  WHEN undefined_function THEN
1045  RAISE EXCEPTION 'Point coordinates (%) are not a valid type
1046  (SVEC, FLOAT[], or INTEGER[]).', expr_point;
1047 END
1048 $$
1049 VOLATILE STRICT
1050 LANGUAGE PLPGSQL;
1051 
1052 /**
1053  * @brief Perform Lloyd's k-means local-search heuristic, but with initial
1054  * centroids stored in a table
1055  *
1056  * This is a shortcut for running k-means with initial centroids stored in a
1057  * table (as opposed to an array of centroids). It is equivalent
1058  * to
1059  * <pre>SELECT \ref kmeans(
1060  rel_source,
1061  expr_point,
1062  (SELECT \ref matrix_agg($expr_centroid) FROM $rel_initial_centroids),
1063  fn_dist,
1064  agg_centroid,
1065  max_num_iterations,
1066  min_frac_reassigned
1067 )</pre>
1068  * where <tt>$expr_centroid</tt> and <tt>$rel_initial_centroids</tt> denote
1069  * textual substituions.
1070  */
1071 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
1072  rel_source VARCHAR,
1073  expr_point VARCHAR,
1074  rel_initial_centroids VARCHAR,
1075  expr_centroid VARCHAR,
1076  fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
1077  agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
1078  max_num_iterations INTEGER /*+ DEFAULT 20 */,
1079  min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
1080 ) RETURNS MADLIB_SCHEMA.kmeans_result
1081 VOLATILE
1082 STRICT
1083 LANGUAGE plpgsql
1084 AS $$
1085 DECLARE
1086  class_rel_initial_centroids REGCLASS;
1087  theResult MADLIB_SCHEMA.kmeans_result;
1088 BEGIN
1089  class_rel_initial_centroids := rel_initial_centroids;
1090  SELECT * FROM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$
1091  SELECT MADLIB_SCHEMA.kmeans(
1092  $1, $2,
1093  (
1094  SELECT MADLIB_SCHEMA.matrix_agg(($sql$ || expr_centroid || $sql$)::FLOAT8[])
1095  FROM $sql$ || textin(regclassout(class_rel_initial_centroids))
1096  || $sql$
1097  ),
1098  $3, $4, $5, $6)
1099  $sql$,
1100  rel_source, expr_point,
1101  fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned)
1102  INTO theResult;
1103  RETURN theResult;
1104 END;
1105 $$;
1106 
1107 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
1108  rel_source VARCHAR,
1109  expr_point VARCHAR,
1110  rel_initial_centroids VARCHAR,
1111  expr_centroid VARCHAR,
1112  fn_dist VARCHAR,
1113  agg_centroid VARCHAR,
1114  max_num_iterations INTEGER
1115 ) RETURNS MADLIB_SCHEMA.kmeans_result
1116 VOLATILE
1117 STRICT
1118 LANGUAGE sql AS $$
1119  SELECT MADLIB_SCHEMA.kmeans(
1120  $1, $2,
1121  $3, $4, $5, $6, $7, 0.001)
1122 $$;
1123 
1124 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
1125  rel_source VARCHAR,
1126  expr_point VARCHAR,
1127  rel_initial_centroids VARCHAR,
1128  expr_centroid VARCHAR,
1129  fn_dist VARCHAR,
1130  agg_centroid VARCHAR
1131 ) RETURNS MADLIB_SCHEMA.kmeans_result
1132 VOLATILE
1133 STRICT
1134 LANGUAGE sql AS $$
1135  SELECT MADLIB_SCHEMA.kmeans(
1136  $1, $2,
1137  $3, $4, $5, $6, 20, 0.001)
1138 $$;
1139 
1140 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
1141  rel_source VARCHAR,
1142  expr_point VARCHAR,
1143  rel_initial_centroids VARCHAR,
1144  expr_centroid VARCHAR,
1145  fn_dist VARCHAR
1146 ) RETURNS MADLIB_SCHEMA.kmeans_result
1147 VOLATILE
1148 STRICT
1149 LANGUAGE sql AS $$
1150  SELECT MADLIB_SCHEMA.kmeans(
1151  $1, $2,
1152  $3, $4, $5, 'MADLIB_SCHEMA.avg', 20, 0.001)
1153 $$;
1154 
1155 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
1156  rel_source VARCHAR,
1157  expr_point VARCHAR,
1158  rel_initial_centroids VARCHAR,
1159  expr_centroid VARCHAR
1160 ) RETURNS MADLIB_SCHEMA.kmeans_result
1161 VOLATILE
1162 STRICT
1163 LANGUAGE sql AS $$
1164  SELECT MADLIB_SCHEMA.kmeans(
1165  $1, $2,
1166  $3, $4,
1167  'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001)
1168 $$;
1169 
1170 
1171 /**
1172  * @internal
1173  * @brief Execute a SQL command where $1, ..., $3 are substituted with the
1174  * given arguments.
1175  */
1176 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_silhouette_args(
1177  sql VARCHAR, centroids DOUBLE PRECISION[][], fn_dist REGPROC
1178 ) RETURNS DOUBLE PRECISION
1179 STABLE
1180 CALLED ON NULL INPUT
1181 LANGUAGE c
1182 AS 'MODULE_PATHNAME', 'exec_sql_using';
1183 
1184 /**
1185  * @brief Compute a simplified version of the silhouette coefficient
1186  *
1187  * @param rel_source Name of the relation containing input points
1188  * @param expr_point Expression evaluating to point coordinates \f$ x_i \f$ for
1189  * each tuple
1190  * @param centroids Matrix \f$ M = (\vec{m_0} \dots \vec{m_{k-1}})
1191  * \in \mathbb{R}^{d \times k} \f$ with \f$ k \f$ columns, where column
1192  * \f$ i \f$ contains the position of centroid \f$ i \f$.
1193  * @param fn_dist Name of a function with signature
1194  * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
1195  * returns the distance between two points
1196  * @return For each point \f$ x_i \f$, let
1197  * \f$ d_1( x_i ) \f$ and \f$ d_2( x_i ) \f$ be the distance to the closest
1198  * and 2nd-closest centroid, respectively. If there is more than one
1199  * closest centroids then \f$ d_1( x_i ) = d_2( x_i )\f$.
1200  * The return value is the average, over all points \f$ x_i \f$, of
1201  * \f[
1202  * \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)},
1203  * \f]
1204  * where 0/0 is interpreted as 0.
1205  * Clearly, the simplified silhouette coefficient assumes values in
1206  * \f$ [0,1] \f$.
1207  */
1208 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette(
1209  rel_source VARCHAR,
1210  expr_point VARCHAR,
1211  centroids DOUBLE PRECISION[][],
1212  fn_dist VARCHAR /*+ DEFAULT 'dist_norm2' */
1213 ) RETURNS DOUBLE PRECISION
1214 STABLE
1215 STRICT
1216 LANGUAGE plpgsql
1217 AS $$
1218 DECLARE
1219  class_rel_source REGCLASS;
1220  proc_fn_dist REGPROCEDURE;
1221  rel_filtered VARCHAR;
1222 BEGIN
1223  IF (array_upper(centroids,1) IS NULL) THEN
1224  RAISE EXCEPTION 'No valid centroids given.';
1225  END IF;
1226 
1227  rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
1228  class_rel_source := rel_filtered;
1229  proc_fn_dist := fn_dist
1230  || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
1231  IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
1232  FROM pg_proc WHERE oid = proc_fn_dist) THEN
1233  RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
1234  END IF;
1235 
1236  RETURN MADLIB_SCHEMA.internal_execute_using_silhouette_args($sql$
1237  SELECT
1238  avg(CASE
1239  WHEN distances[2] = 0 THEN 0
1240  ELSE (distances[2] - distances[1]) / distances[2]
1241  END)
1242  FROM (
1243  SELECT
1244  (MADLIB_SCHEMA.closest_columns(
1245  $1,
1246  ($sql$ || expr_point || $sql$)::FLOAT8[],
1247  2::INTEGER,
1248  $2
1249  )).distances
1250  FROM
1251  $sql$ || textin(regclassout(class_rel_source)) || $sql$
1252  ) AS two_shortest_distances
1253  $sql$,
1254  centroids, proc_fn_dist);
1255 END;
1256 $$;
1257 
1258 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette(
1259  rel_source VARCHAR,
1260  expr_point VARCHAR,
1261  centroids DOUBLE PRECISION[][]
1262 ) RETURNS DOUBLE PRECISION
1263 STABLE
1264 STRICT
1265 LANGUAGE sql
1266 AS $$
1267  SELECT MADLIB_SCHEMA.simple_silhouette($1, $2, $3,
1268  'MADLIB_SCHEMA.dist_norm2')
1269 $$;