User Documentation
kmeans.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file kmeans.sql_in
00004  *
00005  * @brief Set of functions for k-means clustering.
00006  *
00007  * @sa For a brief introduction to k-means clustering, see the module
00008  *     description \ref grp_kmeans.
00009  *
00010  *//* ----------------------------------------------------------------------- */
00011 
00012 m4_include(`SQLCommon.m4')
00013 
00014 /**
00015 @addtogroup grp_kmeans
00016 
00017 @about
00018 
00019 Clustering refers to the problem of partitioning a set of objects according to
00020 some problem-dependent measure of <em>similarity</em>. In the k-means variant,
00021 one is given \f$ n \f$ points \f$ x_1, \dots, x_n \in \mathbb R^d \f$, and the
00022 goal is to position \f$ k \f$ centroids \f$ c_1, \dots, c_k \in \mathbb R^d \f$
00023 so that the sum of \em distances between each point and its closest centroid is
00024 minimized. Each centroid represents a cluster that consists of all points to
00025 which this centroid is closest. Formally, we wish to minimize the following
00026 objective function:
00027 \f[
00028     (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j)
00029 \f]
00030 In the most common case, \f$ \operatorname{dist} \f$ is the square of the
00031 Euclidean distance.
00032 
00033 This problem is computationally difficult (NP-hard), yet the
00034 local-search heuristic proposed by Lloyd [4] performs reasonably well in
00035 practice. In fact, it is so ubiquitous today that it is
00036 often referred to as the <em>standard algorithm</em> or even just the
00037 <em>k-means algorithm</em> [1]. It works as follows:
00038 
00039 -# Seed the \f$ k \f$ centroids (see below)
00040 -# Repeat until convergence:
00041  -# Assign each point to its closest centroid
00042  -# Move each centroid to a position that minimizes the sum of distances in this
00043     cluster
00044 -# Convergence is achieved when no points change their assignments during step
00045    2a.
00046 
00047 Since the objective function decreases in every step, this algorithm is
00048 guaranteed to converge to a local optimum.
00049 
00050 @implementation
00051 
00052 Data points and predefined centroids (if used) are expected to be stored row-wise,
00053 in a column of type <tt>\ref grp_svec "SVEC"</tt> (or any type convertible to
00054 <tt>\ref grp_svec "SVEC"</tt>, like <tt>FLOAT[]</tt> or <tt>INTEGER[]</tt>).
00055 Data points with non-finite values (NULL, NaN, infinity) in any component
00056 will be skipped during analysis.
00057 
00058 The following methods are available for the centroid seeding:
00059  - <strong>random selection</strong>:
00060    Select \f$ k \f$ centroids randomly among the input points.
00061  - <strong>kmeans++</strong> [2]:
00062    Start with a single centroid chosen randomly among the input points. Then
00063    iteratively choose new
00064    centroids from the input points until there is a total of \f$ k \f$
00065    centroids. The probability for picking a particular point is proportional to
00066    its minimum distance to any existing centroid.
00067    \n
00068    Intuitively, kmeans++ favors seedings where centroids are spread out over the
00069    whole range of the input points, while at the same time not being too
00070    susceptible to outliers [2].
00071  - <strong>user-specified set of initial centroids</strong>:
00072    See below for a description of the expected format of the set of initial
00073    centroids.
00074 
00075 The following distance functions can be used (computation of barycenter/mean in parentheses):
00076  - <strong>\ref dist_norm1</strong>:  1-norm/Manhattan (element-wise median [Note that MADlib does not provide a median
00077    aggregate function for support and performance reasons.])
00078  - <strong>\ref dist_norm2</strong>: 2-norm/Euclidean (element-wise mean)
00079  - <strong>\ref squared_dist_norm2</strong>: squared Euclidean distance (element-wise mean)
00080  - <strong>\ref dist_angle</strong>: angle (element-wise mean of normalized points)
00081  - <strong>\ref dist_tanimoto</strong>: tanimoto (element-wise mean of normalized points [5])
00082  - <strong>user defined function</strong> with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION
00083 
00084 The following aggregate functions for determining centroids can be used:
00085  - <strong>\ref avg</strong>: average
00086  - <strong>\ref normalized_avg</strong>: normalized average
00087 
00088 The algorithm stops when one of the following conditions is met:
00089  - The fraction of updated points is smaller than the convergence threshold (default: 0.001).
00090  - The algorithm reaches the maximum number of allowed iterations (default: 20).
00091 
00092 A popular method to assess the quality of the clustering is the
00093 <em>silhouette coefficient</em>, a simplified version of which is provided as part of the k-means module.
00094 Note that for large data sets, this computation is expensive.
00095 
00096 @input
00097 The <strong>source relation</strong> is expected to be of the following form (or
00098 to be implicitly convertible into the following form):
00099 <pre>{TABLE|VIEW} <em>rel_source</em> (
00100     ...
00101     <em>expr_points</em> FLOAT8[],
00102     ...
00103 )</pre>
00104 where:
00105  - <em>expr_points</em> is the name of a column with point coordinates.
00106    Types such as \c svec or <tt>INTEGER[]</tt> are implicitly converted to
00107    <tt>FLOAT8[]</tt>.
00108 
00109 If kmeans is called with a set of initial centroids, the centroid relation is
00110 expected to be of the following form:
00111 <pre>{TABLE|VIEW} <em>rel_initial_centroids</em> (
00112     ...
00113     <em>expr_centroid</em> DOUBLE PRECISION[],
00114     ...
00115 )</pre>
00116 where:
00117  - <em>expr_centroid</em> is the name of a column with coordinates.
00118 
00119 @usage
00120 The k-means algorithm can be invoked in four possible ways:
00121 
00122 - using <em>random</em> centroid seeding method for a
00123 provided \f$ k \f$:
00124 <pre>SELECT * FROM \ref kmeans_random(
00125   '<em>rel_source</em>', '<em>expr_point</em>', k,
00126   [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
00127   <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
00128 );</pre>
00129 
00130 - using <em>kmeans++</em> centroid seeding method for a
00131 provided \f$ k \f$:
00132 <pre>SELECT * FROM \ref kmeanspp(
00133   '<em>rel_source</em>', '<em>expr_point</em>', k,
00134   [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
00135   <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
00136 );</pre>
00137 
00138 - with a provided centroid set:
00139 <pre>SELECT * FROM \ref kmeans(
00140   '<em>rel_source</em>', '<em>expr_point</em>',
00141   '<em>rel_initial_centroids</em>', '<em>expr_centroid</em>',
00142   [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
00143   <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
00144 );</pre>
00145 ------------ OR ---------------
00146 <pre>SELECT * FROM \ref kmeans(
00147   '<em>rel_source</em>', '<em>expr_point</em>',
00148   initial_centroids,
00149   [ '<em>fn_dist</em>', '<em>agg_centroid</em>',
00150   <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ]
00151 );</pre>
00152 where:
00153  - <em>initial_centroids</em> is of type <tt>DOUBLE PRECISION[][]</tt>.
00154 
00155 The output of the k-means module is a table that includes the final
00156 centroid positions (DOUBLE PRECISION[][]), the objective function,
00157 the fraction of reassigned points in the last iteration, and
00158 the number of total iterations:
00159 <pre>
00160             centroids             |   objective_fn   | frac_reassigned | num_iterations
00161 ----------------------------------+------------------+-----------------+----------------
00162              ...
00163 </pre>
00164 
00165 @examp
00166 
00167 -#  Prepare some input data.
00168 \code
00169 sql> SELECT * FROM public.km_sample LIMIT 5;
00170                   points
00171 -------------------------------------------
00172  {1,1}:{15.8822241332382,105.945462542586}
00173  {1,1}:{34.5065216883086,72.3126099305227}
00174  {1,1}:{22.5074400822632,95.3209559689276}
00175  {1,1}:{70.2589857042767,68.7395178806037}
00176  {1,1}:{30.9844257542863,25.3213323024102}
00177 (5 rows)
00178 \endcode
00179 Note: the example <em>points</em> is type <tt>\ref grp_svec "SVEC"</tt>.
00180 -#  Run k-means clustering using kmeans++ for centroid seeding:
00181 \code
00182 sql> SELECT * FROM madlib.kmeanspp('km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001);
00183 );
00184                                 centroids                                |   objective_fn   | frac_reassigned | num_iterations
00185 -------------------------------------------------------------------------+------------------+-----------------+----------------
00186  {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 |           0.001 |              5
00187 \endcode
00188 -# Calculate the simplified silhouette coefficient:
00189 \code
00190 sql> SELECT * from madlib.simple_silhouette('km_test_svec','points',
00191      (select centroids from madlib.kmeanspp('km_test_svec','points',2,'madlib.squared_dist_norm2','madlib.avg',20,0.001)),
00192      'madlib.dist_norm2');
00193  simple_silhouette
00194 -------------------
00195  0.611022970398174
00196 \endcode
00197 
00198 @literature
00199 
00200 [1] Wikipedia, K-means Clustering,
00201     http://en.wikipedia.org/wiki/K-means_clustering
00202 
00203 [2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful
00204     seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete
00205     Algorithms (SODA'07), pp. 1027-1035,
00206     http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf
00207 
00208 [3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering
00209     Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and
00210     Evolutionary Search. In: Studies in Computational Intelligence - Hybrid
00211     Evolutionary Algorithms. pp. 313-335. Springer. 2007.
00212 
00213 [4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell
00214     Laboratories. Published much later in: IEEE Transactions on Information
00215     Theory 28(2), pp. 128-137. 1982.
00216 
00217 [5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis.  In: Computational
00218     Statistics and Data Analysis, 51(2). pp. 526-544. 2006.
00219 
00220 @sa File kmeans.sql_in documenting the SQL functions.
00221 
00222 @internal
00223 @sa namespace kmeans (documenting the implementation in Python)
00224 @endinternal
00225 */
00226 
00227 /*
00228  * @brief k-Means return type
00229  *
00230  * A composite value:
00231  *  - <tt>centroids</tt> - Matrix containing the new \f$ l \leq k \f$
00232  *    repositioned centroids as columns. If this matrix has \f$ l < k \f$
00233  *    columns, one or more old centroids no longer were closest to any point.
00234  *  - <tt>old_centroid_its</tt> - The order of the centroids in
00235  *    <tt>centroid</tt> is not guaranteed to be consitent across iterations.
00236  *    In particular, if a centroid is no longer closest to any point it can be
00237  *    dropped and a new centroid is added afterwards. We therefore need to map
00238  *    positions in <tt>centroids</tt> to the respective positions in the
00239  *    previous iteration.
00240  *  - <tt>objective_fn</tt> - Value of the objective function, i.e.,
00241  *    \f$ \sum_{x \in P} \dist(x, C)^2 \f$ where
00242  *    \f$ P \f$ is the set of points, \f$ C \f$ is the set of centroids, and
00243  *    \f$ \dist(x, C) := \min_{c \in C} \operatorname{dist}(x, c) \f$.
00244  *  - <tt>frac_reassigned</tt> - Fraction of points that was assigned a
00245  *    different centroid in the current iteration.
00246  *  - <tt>num_iterations</tt> - Number of iterations performed (so far).
00247  */
00248 CREATE TYPE MADLIB_SCHEMA.kmeans_result AS (
00249     centroids DOUBLE PRECISION[][],
00250     objective_fn DOUBLE PRECISION,
00251     frac_reassigned DOUBLE PRECISION,
00252     num_iterations INTEGER
00253 );
00254 
00255 /*
00256  * @brief k-Means inter-iteration state type
00257  *
00258  * A composite value like \ref{kmeans_result}. Additional fields:
00259  *  - <tt>old_centroid_its</tt> - The order of the centroids in
00260  *    <tt>centroid</tt> is not guaranteed to be consitent across iterations.
00261  *    In particular, if a centroid is no longer closest to any point it can be
00262  *    dropped and a new centroid is added afterwards. We therefore need to map
00263  *    positions in <tt>centroids</tt> to the respective positions in the
00264  *    previous iteration.
00265  *  - <tt>num_iterations</tt> - Number of iterations performed (so far).
00266  */
00267 CREATE TYPE MADLIB_SCHEMA.kmeans_state AS (
00268     centroids DOUBLE PRECISION[][],
00269     old_centroid_ids INTEGER[],
00270     objective_fn DOUBLE PRECISION,
00271     frac_reassigned DOUBLE PRECISION
00272 );
00273 
00274 /**
00275  * @internal
00276  * @brief Execute a SQL command where $1, ..., $4 are substituted with the
00277   *     given arguments.
00278  */
00279 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args(
00280     sql VARCHAR, DOUBLE PRECISION[][], REGPROC, INTEGER, DOUBLE PRECISION
00281 ) RETURNS VOID
00282 VOLATILE
00283 CALLED ON NULL INPUT
00284 LANGUAGE c
00285 AS 'MODULE_PATHNAME', 'exec_sql_using';
00286 
00287 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans(
00288     rel_args VARCHAR,
00289     rel_state VARCHAR,
00290     rel_source VARCHAR,
00291     expr_point VARCHAR,
00292     agg_centroid VARCHAR)
00293 RETURNS INTEGER
00294 VOLATILE
00295 LANGUAGE plpythonu
00296 AS $$PythonFunction(kmeans, kmeans, compute_kmeans)$$;
00297 
00298 /**
00299  * @brief Perform Lloyd's k-means local-search heuristic
00300  *
00301  * @param rel_source Name of the relation containing input points
00302  * @param expr_point Expression evaluating to point coordinates for each tuple
00303  * @param initial_centroids Matrix containing the initial centroids as columns
00304  * @param fn_dist Name of a function with signature
00305  *     <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
00306  *     returns the distance between two points. The default is the
00307  *     \ref squared_dist_norm2(float8[],float8[]) "squared Euclidean distance".
00308  * @param agg_centroid Name of an aggregate function with signature
00309  *     <tt>DOUBLE PRECISION[] -> DOUBLE PRECISION[]</tt> that, for each group
00310  *     of points, returns a centroid. In order for Lloyd's local-search
00311  *     heuristic to provably converge and to return a local minimum, this
00312  *     centroid should minimize the sum of distances between each point in the
00313  *     group and the centroid. The default is the
00314  *     \ref avg(float8[]) "average (mean/barycenter in Euclidean space)",
00315  *     which satisfies this property if <tt>fn_dist = 'squared_dist_norm2'</tt>.
00316  * @param max_num_iterations Maximum number of iterations
00317  * @param min_frac_reassigned Fraction of reassigned points below which
00318  *     convergence is assumed and the algorithm terminates
00319  * @returns A composite value:
00320  *  - <tt>centroids</tt> - Matrix with \f$ k \f$ centroids as columns.
00321  *  - <tt>frac_reassigned</tt> - Fraction of points that were assigned a
00322  *    different centroid in the last iteration.
00323  *  - <tt>num_iterations</tt> - The number of iterations before the
00324  *    algorithm terminated
00325  */
00326 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
00327     rel_source VARCHAR,
00328     expr_point VARCHAR,
00329     initial_centroids DOUBLE PRECISION[][],
00330     fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
00331     agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
00332     max_num_iterations INTEGER /*+ DEFAULT 20 */,
00333     min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
00334 ) RETURNS MADLIB_SCHEMA.kmeans_result AS $$
00335 DECLARE
00336     theIteration INTEGER;
00337     theResult MADLIB_SCHEMA.kmeans_result;
00338     oldClientMinMessages VARCHAR;
00339     class_rel_source REGCLASS;
00340     proc_fn_dist REGPROCEDURE;
00341     proc_agg_centroid REGPROCEDURE;
00342     rel_filtered VARCHAR;
00343     num_points INTEGER;
00344     k INTEGER;
00345     centroids FLOAT8[];
00346 BEGIN
00347     IF (array_upper(initial_centroids,1) IS NULL) THEN
00348     RAISE EXCEPTION 'No valid initial centroids given.';
00349     END IF;
00350 
00351     centroids := ARRAY(SELECT unnest(initial_centroids));
00352     IF (SELECT MADLIB_SCHEMA.svec_elsum(centroids)) >= 'Infinity'::float THEN
00353         RAISE EXCEPTION 'At least one initial centroid has non-finite values.';
00354     END IF;
00355 
00356     rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
00357     class_rel_source := rel_filtered;
00358     proc_fn_dist := fn_dist
00359         || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
00360     IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
00361         FROM pg_proc WHERE oid = proc_fn_dist) THEN
00362         RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
00363     END IF;
00364     proc_agg_centroid := agg_centroid || '(DOUBLE PRECISION[])';
00365     IF (SELECT prorettype != 'DOUBLE PRECISION[]'::regtype OR proisagg = FALSE
00366         FROM pg_proc WHERE oid = proc_agg_centroid) THEN
00367         RAISE EXCEPTION 'Mean aggregate has wrong signature or is not an aggregate.';
00368     END IF;
00369     IF (min_frac_reassigned < 0) OR (min_frac_reassigned > 1) THEN
00370         RAISE EXCEPTION 'Convergence threshold is not a valid value (must be a fraction between 0 and 1).';
00371     END IF;
00372     IF (max_num_iterations < 0) THEN
00373         RAISE EXCEPTION 'Number of iterations must be a non-negative integer.';
00374     END IF;
00375 
00376     -- Extra parameter check added so that ERROR output is more user-readable (doesn't include Python traceback)
00377     k := array_upper(initial_centroids,1);
00378     IF (k <= 0) THEN
00379         RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
00380     END IF;
00381     IF (k > 32767) THEN
00382     RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
00383     END IF;
00384     EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ;
00385     IF (num_points < k) THEN
00386     RAISE EXCEPTION 'Number of centroids is greater than number of points.';
00387     END IF;
00388 
00389     -- We first setup the argument table. Rationale: We want to avoid all data
00390     -- conversion between native types and Python code. Instead, we use Python
00391     -- as a pure driver layer.
00392     PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
00393     oldClientMinMessages :=
00394         (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
00395     EXECUTE 'SET client_min_messages TO warning';
00396 
00397     -- Unfortunately, the EXECUTE USING syntax is only available starting
00398     -- PostgreSQL 8.4:
00399     -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
00400     -- We therefore have to emulate.
00401     PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$
00402         DROP TABLE IF EXISTS pg_temp._madlib_kmeans_args;
00403         CREATE TABLE pg_temp._madlib_kmeans_args AS
00404         SELECT
00405             $1 AS initial_centroids, array_upper($1, 1) AS k,
00406             $2 AS fn_dist, $3 AS max_num_iterations,
00407             $4 AS min_frac_reassigned;
00408         $sql$,
00409         initial_centroids, proc_fn_dist, max_num_iterations,
00410         min_frac_reassigned);
00411     EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
00412 
00413     -- Perform acutal computation.
00414     -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
00415     -- operators from regclass to varchar/text.
00416     theIteration := MADLIB_SCHEMA.internal_compute_kmeans('_madlib_kmeans_args',
00417             '_madlib_kmeans_state',
00418             textin(regclassout(class_rel_source)), expr_point,
00419             textin(regprocout(proc_agg_centroid)));
00420 
00421     -- Retrieve result from state table and return it
00422     EXECUTE
00423         $sql$
00424         SELECT (_state).centroids, (_state).objective_fn,
00425             (_state).frac_reassigned, NULL
00426         FROM _madlib_kmeans_state
00427         WHERE _iteration = $sql$ || theIteration || $sql$
00428         $sql$
00429         INTO theResult;
00430     -- The number of iterations are not updated in the C++ code. We do it here.
00431     IF NOT (theResult IS NULL) THEN
00432         theResult.num_iterations = theIteration;
00433     END IF;
00434     RETURN theResult;
00435 END;
00436 $$ LANGUAGE plpgsql VOLATILE;
00437 
00438 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
00439     rel_source VARCHAR,
00440     expr_point VARCHAR,
00441     initial_centroids DOUBLE PRECISION[][],
00442     fn_dist VARCHAR,
00443     agg_centroid VARCHAR,
00444     max_num_iterations INTEGER
00445 ) RETURNS MADLIB_SCHEMA.kmeans_result
00446 VOLATILE
00447 STRICT
00448 LANGUAGE sql AS $$
00449     SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, $6, 0.001)
00450 $$;
00451 
00452 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
00453     rel_source VARCHAR,
00454     expr_point VARCHAR,
00455     initial_centroids DOUBLE PRECISION[][],
00456     fn_dist VARCHAR,
00457     agg_centroid VARCHAR
00458 ) RETURNS MADLIB_SCHEMA.kmeans_result
00459 VOLATILE
00460 STRICT
00461 LANGUAGE sql AS $$
00462     SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, 20, 0.001)
00463 $$;
00464 
00465 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
00466     rel_source VARCHAR,
00467     expr_point VARCHAR,
00468     initial_centroids DOUBLE PRECISION[][],
00469     fn_dist VARCHAR
00470 ) RETURNS MADLIB_SCHEMA.kmeans_result
00471 VOLATILE
00472 STRICT
00473 LANGUAGE sql AS $$
00474     SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, 'MADLIB_SCHEMA.avg', 20,
00475         0.001)
00476 $$;
00477 
00478 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
00479     rel_source VARCHAR,
00480     expr_point VARCHAR,
00481     initial_centroids DOUBLE PRECISION[][]
00482 ) RETURNS MADLIB_SCHEMA.kmeans_result
00483 VOLATILE
00484 STRICT
00485 LANGUAGE sql AS $$
00486     SELECT MADLIB_SCHEMA.kmeans($1, $2, $3,
00487         'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001)
00488 $$;
00489 
00490 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args(
00491     sql VARCHAR, INTEGER, REGPROC, DOUBLE PRECISION[][]
00492 ) RETURNS VOID
00493 VOLATILE
00494 CALLED ON NULL INPUT
00495 LANGUAGE c
00496 AS 'MODULE_PATHNAME', 'exec_sql_using';
00497 
00498 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeanspp_seeding(
00499     rel_args VARCHAR,
00500     rel_state VARCHAR,
00501     rel_source VARCHAR,
00502     expr_point VARCHAR)
00503 RETURNS INTEGER
00504 AS $$PythonFunction(kmeans, kmeans, compute_kmeanspp_seeding)$$
00505 LANGUAGE plpythonu VOLATILE;
00506 
00507 /**
00508  * @brief k-Means++ Seeding
00509  *
00510  * @param rel_source Name of the relation containing input points
00511  * @param expr_point Expression evaluating to point coordinates for each tuple
00512  * @param k Number of centroids
00513  * @param fn_dist Name of a function with signature
00514  *     <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
00515  *     returns the distance between two points
00516  * @param initial_centroids A matrix containing up to \f$ k \f$ columns as
00517  *     columns. kmeanspp_seeding() proceeds exactly as if these centroids had
00518  *     already been generated in previous iterations. This parameter may be
00519  *     NULL in which all \f$ k \f$ centroids will be generated.
00520  * @returns A matrix containing \f$ k \f$ centroids as columns
00521  */
00522 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
00523     rel_source VARCHAR,
00524     expr_point VARCHAR,
00525     k INTEGER,
00526     fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
00527     initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */
00528 ) RETURNS DOUBLE PRECISION[][] AS $$
00529 DECLARE
00530     theIteration INTEGER;
00531     theResult DOUBLE PRECISION[][];
00532     oldClientMinMessages VARCHAR;
00533     class_rel_source REGCLASS;
00534     proc_fn_dist REGPROCEDURE;
00535     num_points INTEGER;
00536     num_centroids INTEGER;
00537     rel_filtered VARCHAR;
00538 BEGIN
00539     rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
00540     class_rel_source := rel_filtered;
00541 
00542     IF (initial_centroids IS NOT NULL) THEN
00543     num_centroids := array_upper(initial_centroids,1);
00544     ELSE
00545     num_centroids := k;
00546     END IF;
00547 
00548     proc_fn_dist := fn_dist
00549         || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
00550     IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
00551         FROM pg_proc WHERE oid = proc_fn_dist) THEN
00552         RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
00553     END IF;
00554     IF (k <= 0) THEN
00555         RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
00556     END IF;
00557     IF (k > 32767) THEN
00558     RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
00559     END IF;
00560     EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ;
00561     IF (num_points < k OR num_points < num_centroids) THEN
00562     RAISE EXCEPTION 'Number of centroids is greater than number of points.';
00563     END IF;
00564     IF (k < num_centroids) THEN
00565     RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.';
00566     END IF;
00567 
00568     -- We first setup the argument table. Rationale: We want to avoid all data
00569     -- conversion between native types and Python code. Instead, we use Python
00570     -- as a pure driver layer.
00571     oldClientMinMessages :=
00572         (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
00573     EXECUTE 'SET client_min_messages TO warning';
00574     PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
00575     -- Unfortunately, the EXECUTE USING syntax is only available starting
00576     -- PostgreSQL 8.4:
00577     -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
00578     -- We therefore have to emulate.
00579     PERFORM MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args($sql$
00580         DROP TABLE IF EXISTS pg_temp._madlib_kmeanspp_args;
00581         CREATE TEMPORARY TABLE _madlib_kmeanspp_args AS
00582         SELECT $1 AS k, $2 AS fn_dist, $3 AS initial_centroids;
00583         $sql$,
00584         k, proc_fn_dist, initial_centroids);
00585     EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
00586 
00587     -- Perform acutal computation.
00588     -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
00589     -- operators from regclass to varchar/text.
00590     theIteration := (
00591         SELECT MADLIB_SCHEMA.internal_compute_kmeanspp_seeding(
00592             '_madlib_kmeanspp_args', '_madlib_kmeanspp_state',
00593             textin(regclassout(class_rel_source)), expr_point)
00594     );
00595 
00596     -- Retrieve result from state table and return it
00597     EXECUTE
00598         $sql$
00599         SELECT _state FROM _madlib_kmeanspp_state
00600         WHERE _iteration = $sql$ || theIteration || $sql$
00601         $sql$
00602         INTO theResult;
00603     RETURN theResult;
00604 END;
00605 $$ LANGUAGE plpgsql VOLATILE;
00606 
00607 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
00608     rel_source VARCHAR,
00609     expr_point VARCHAR,
00610     k INTEGER,
00611     fn_dist VARCHAR
00612 ) RETURNS DOUBLE PRECISION[][]
00613 LANGUAGE sql AS $$
00614     SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4, NULL)
00615 $$;
00616 
00617 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding(
00618     rel_source VARCHAR,
00619     expr_point VARCHAR,
00620     k INTEGER
00621 ) RETURNS DOUBLE PRECISION[][]
00622 LANGUAGE sql AS $$
00623     SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3,
00624         'MADLIB_SCHEMA.squared_dist_norm2', NULL)
00625 $$;
00626 
00627 /**
00628  * @brief Run k-Means++.
00629  *
00630  * This is a shortcut for running k-means++. It is equivalent to
00631  * <pre>SELECT \ref kmeans(
00632     rel_source,
00633     expr_point,
00634     \ref kmeanspp_seeding(
00635         rel_source,
00636         expr_point,
00637         k,
00638         fn_dist
00639     ),
00640     fn_dist,
00641     agg_centroid,
00642     max_num_iterations,
00643     min_frac_reassigned
00644 )</pre>
00645  */
00646 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
00647     rel_source VARCHAR,
00648     expr_point VARCHAR,
00649     k INTEGER,
00650     fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
00651     agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
00652     max_num_iterations INTEGER /*+ DEFAULT 20 */,
00653     min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
00654 ) RETURNS MADLIB_SCHEMA.kmeans_result
00655 VOLATILE
00656 STRICT
00657 LANGUAGE plpgsql
00658 AS $$
00659 DECLARE
00660     ret MADLIB_SCHEMA.kmeans_result;
00661 BEGIN
00662     ret = MADLIB_SCHEMA.kmeans(
00663         $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
00664         $4, $5, $6, $7);
00665     RETURN ret;
00666 END
00667 $$;
00668 
00669 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
00670     rel_source VARCHAR,
00671     expr_point VARCHAR,
00672     k INTEGER,
00673     fn_dist VARCHAR,
00674     agg_centroid VARCHAR,
00675     max_num_iterations INTEGER
00676 ) RETURNS MADLIB_SCHEMA.kmeans_result
00677 VOLATILE
00678 STRICT
00679 LANGUAGE plpgsql
00680 AS $$
00681 DECLARE
00682     ret MADLIB_SCHEMA.kmeans_result;
00683 BEGIN
00684     ret = MADLIB_SCHEMA.kmeans(
00685         $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
00686         $4, $5, $6, 0.001);
00687     RETURN ret;
00688 END
00689 $$;
00690 
00691 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
00692     rel_source VARCHAR,
00693     expr_point VARCHAR,
00694     k INTEGER,
00695     fn_dist VARCHAR,
00696     agg_centroid VARCHAR
00697 ) RETURNS MADLIB_SCHEMA.kmeans_result
00698 VOLATILE
00699 STRICT
00700 LANGUAGE plpgsql
00701 AS $$
00702 DECLARE
00703     ret MADLIB_SCHEMA.kmeans_result;
00704 BEGIN
00705     ret = MADLIB_SCHEMA.kmeans(
00706         $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
00707         $4, $5, 20, 0.001);
00708     RETURN ret;
00709 END
00710 $$;
00711 
00712 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
00713     rel_source VARCHAR,
00714     expr_point VARCHAR,
00715     k INTEGER,
00716     fn_dist VARCHAR
00717 ) RETURNS MADLIB_SCHEMA.kmeans_result
00718 VOLATILE
00719 STRICT
00720 LANGUAGE plpgsql
00721 AS $$
00722 DECLARE
00723     ret MADLIB_SCHEMA.kmeans_result;
00724 BEGIN
00725     ret = MADLIB_SCHEMA.kmeans(
00726         $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4),
00727         $4, 'MADLIB_SCHEMA.avg', 20, 0.001);
00728     RETURN ret;
00729 END
00730 $$;
00731 
00732 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp(
00733     rel_source VARCHAR,
00734     expr_point VARCHAR,
00735     k INTEGER
00736 ) RETURNS MADLIB_SCHEMA.kmeans_result
00737 VOLATILE
00738 STRICT
00739 LANGUAGE plpgsql
00740 AS $$
00741 DECLARE
00742     ret MADLIB_SCHEMA.kmeans_result;
00743 BEGIN
00744     ret = MADLIB_SCHEMA.kmeans(
00745         $1, $2,
00746         MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3,
00747             'MADLIB_SCHEMA.squared_dist_norm2'),
00748         'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001);
00749     RETURN ret;
00750 END
00751 $$;
00752 
00753 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args(
00754     sql VARCHAR, INTEGER, DOUBLE PRECISION[][]
00755 ) RETURNS VOID
00756 VOLATILE
00757 CALLED ON NULL INPUT
00758 LANGUAGE c
00759 AS 'MODULE_PATHNAME', 'exec_sql_using';
00760 
00761 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans_random_seeding(
00762     rel_args VARCHAR,
00763     rel_state VARCHAR,
00764     rel_source VARCHAR,
00765     expr_point VARCHAR)
00766 RETURNS INTEGER
00767 AS $$PythonFunction(kmeans, kmeans, compute_kmeans_random_seeding)$$
00768 LANGUAGE plpythonu VOLATILE;
00769 
00770 /**
00771  * @brief k-Means Random Seeding
00772  *
00773  * @param rel_source Name of the relation containing input points
00774  * @param expr_point Expression evaluating to point coordinates for each tuple
00775  * @param k Number of centroids
00776  * @param initial_centroids A matrix containing up to \f$ k \f$ columns as
00777  *     columns. kmeanspp_seeding() proceeds exactly as if these centroids had
00778  *     already been generated in previous iterations. This parameter may be
00779  *     NULL in which all \f$ k \f$ centroids will be generated.
00780  * @returns A matrix containing \f$ k \f$ centroids as columns
00781  */
00782 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding(
00783     rel_source VARCHAR,
00784     expr_point VARCHAR,
00785     k INTEGER,
00786     initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */
00787 ) RETURNS DOUBLE PRECISION[][] AS $$
00788 DECLARE
00789     theIteration INTEGER;
00790     theResult DOUBLE PRECISION[][];
00791     oldClientMinMessages VARCHAR;
00792     class_rel_source REGCLASS;
00793     proc_fn_dist REGPROCEDURE;
00794     num_points INTEGER;
00795     num_centroids INTEGER;
00796     rel_filtered VARCHAR;
00797 BEGIN
00798     rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
00799     class_rel_source := rel_filtered;
00800 
00801     IF (initial_centroids IS NOT NULL) THEN
00802     num_centroids := array_upper(initial_centroids,1);
00803     ELSE
00804     num_centroids := k;
00805     END IF;
00806 
00807     IF (k <= 0) THEN
00808         RAISE EXCEPTION 'Number of clusters k must be a positive integer.';
00809     END IF;
00810     IF (k > 32767) THEN
00811     RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
00812     END IF;
00813     EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points;
00814     IF (num_points < k  OR num_points < num_centroids) THEN
00815     RAISE EXCEPTION 'Number of centroids is greater than number of points.';
00816     END IF;
00817     IF (k < num_centroids) THEN
00818     RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.';
00819     END IF;
00820 
00821     -- We first setup the argument table. Rationale: We want to avoid all data
00822     -- conversion between native types and Python code. Instead, we use Python
00823     -- as a pure driver layer.
00824     oldClientMinMessages :=
00825         (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
00826     EXECUTE 'SET client_min_messages TO warning';
00827     PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
00828     -- Unfortunately, the EXECUTE USING syntax is only available starting
00829     -- PostgreSQL 8.4:
00830     -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
00831     -- We therefore have to emulate.
00832     PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args($sql$
00833         DROP TABLE IF EXISTS pg_temp._madlib_kmeans_random_args;
00834         CREATE TEMPORARY TABLE _madlib_kmeans_random_args AS
00835         SELECT $1 AS k, $2 AS initial_centroids;
00836         $sql$,
00837         k, initial_centroids);
00838     EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
00839 
00840     -- Perform acutal computation.
00841     -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion
00842     -- operators from regclass to varchar/text.
00843     theIteration := (
00844         SELECT MADLIB_SCHEMA.internal_compute_kmeans_random_seeding(
00845             '_madlib_kmeans_random_args', '_madlib_kmeans_random_state',
00846             textin(regclassout(class_rel_source)), expr_point)
00847     );
00848 
00849     -- Retrieve result from state table and return it
00850     EXECUTE
00851         $sql$
00852         SELECT _state FROM _madlib_kmeans_random_state
00853         WHERE _iteration = $sql$ || theIteration || $sql$
00854         $sql$
00855         INTO theResult;
00856     RETURN theResult;
00857 END;
00858 $$ LANGUAGE plpgsql VOLATILE;
00859 
00860 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding(
00861     rel_source VARCHAR,
00862     expr_point VARCHAR,
00863     k INTEGER
00864 ) RETURNS DOUBLE PRECISION[][]
00865 LANGUAGE sql AS $$
00866     SELECT MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3, NULL)
00867 $$;
00868 
00869 /**
00870  * @brief Run k-Means with random seeding.
00871  *
00872  * This is a shortcut for running k-means with random seeding. It is equivalent
00873  * to
00874  * <pre>SELECT \ref kmeans(
00875     rel_source,
00876     expr_point,
00877     \ref kmeans_random_seeding(
00878         rel_source,
00879         expr_point,
00880         k
00881     ),
00882     fn_dist,
00883     agg_centroid,
00884     max_num_iterations,
00885     min_frac_reassigned
00886 )</pre>
00887  */
00888 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
00889     rel_source VARCHAR,
00890     expr_point VARCHAR,
00891     k INTEGER,
00892     fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
00893     agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
00894     max_num_iterations INTEGER /*+ DEFAULT 20 */,
00895     min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
00896 ) RETURNS MADLIB_SCHEMA.kmeans_result
00897 VOLATILE
00898 STRICT
00899 LANGUAGE plpgsql
00900 AS $$
00901 DECLARE
00902     ret MADLIB_SCHEMA.kmeans_result;
00903 BEGIN
00904     ret = MADLIB_SCHEMA.kmeans(
00905         $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
00906         $4, $5, $6, $7);
00907     RETURN ret;
00908 END
00909 $$;
00910 
00911 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
00912     rel_source VARCHAR,
00913     expr_point VARCHAR,
00914     k INTEGER,
00915     fn_dist VARCHAR,
00916     agg_centroid VARCHAR,
00917     max_num_iterations INTEGER
00918 ) RETURNS MADLIB_SCHEMA.kmeans_result
00919 VOLATILE
00920 STRICT
00921 LANGUAGE plpgsql
00922 AS $$
00923 DECLARE
00924     ret MADLIB_SCHEMA.kmeans_result;
00925 BEGIN
00926     ret = MADLIB_SCHEMA.kmeans(
00927         $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
00928         $4, $5, $6, 0.001);
00929     RETURN ret;
00930 END
00931 $$;
00932 
00933 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
00934     rel_source VARCHAR,
00935     expr_point VARCHAR,
00936     k INTEGER,
00937     fn_dist VARCHAR,
00938     agg_centroid VARCHAR
00939 ) RETURNS MADLIB_SCHEMA.kmeans_result
00940 VOLATILE
00941 STRICT
00942 LANGUAGE plpgsql
00943 AS $$
00944 DECLARE
00945     ret MADLIB_SCHEMA.kmeans_result;
00946 BEGIN
00947     ret = MADLIB_SCHEMA.kmeans(
00948         $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
00949         $4, $5, 20, 0.001);
00950     RETURN ret;
00951 END
00952 $$;
00953 
00954 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
00955     rel_source VARCHAR,
00956     expr_point VARCHAR,
00957     k INTEGER,
00958     fn_dist VARCHAR
00959 ) RETURNS MADLIB_SCHEMA.kmeans_result
00960 VOLATILE
00961 STRICT
00962 LANGUAGE plpgsql
00963 AS $$
00964 DECLARE
00965     ret MADLIB_SCHEMA.kmeans_result;
00966 BEGIN
00967     ret = MADLIB_SCHEMA.kmeans(
00968         $1, $2,
00969         MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
00970         $4, 'MADLIB_SCHEMA.avg', 20, 0.001);
00971     RETURN ret;
00972 END
00973 $$;
00974 
00975 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random(
00976     rel_source VARCHAR,
00977     expr_point VARCHAR,
00978     k INTEGER
00979 ) RETURNS MADLIB_SCHEMA.kmeans_result
00980 VOLATILE
00981 STRICT
00982 LANGUAGE plpgsql
00983 AS $$
00984 DECLARE
00985     ret MADLIB_SCHEMA.kmeans_result;
00986 BEGIN
00987     ret = MADLIB_SCHEMA.kmeans(
00988         $1, $2,
00989         MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3),
00990         'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001);
00991     RETURN ret;
00992 END
00993 $$;
00994 
00995 /**
00996  * @internal
00997  * @brief Execute a SQL command where $1, ..., $6 are substituted with the
00998  *     given arguments.
00999  */
01000 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args(
01001     sql VARCHAR, rel_source VARCHAR, expr_point VARCHAR,
01002     fn_dist VARCHAR, agg_centroid VARCHAR, max_num_iterations INTEGER,
01003     min_frac_reassigned DOUBLE PRECISION
01004 ) RETURNS MADLIB_SCHEMA.kmeans_result
01005 VOLATILE
01006 CALLED ON NULL INPUT
01007 LANGUAGE c
01008 AS 'MODULE_PATHNAME', 'exec_sql_using';
01009 
01010 /**
01011  * @internal
01012  * @brief Filter out the invalid data points in the original input relation
01013  */
01014 CREATE FUNCTION MADLIB_SCHEMA.__filter_input_relation(
01015     rel_source VARCHAR, expr_point VARCHAR)
01016 RETURNS VARCHAR
01017 AS $$
01018 DECLARE
01019     oldClientMinMessages VARCHAR;
01020     rel_source_filtered VARCHAR;
01021 BEGIN
01022     IF (SELECT position('.' in rel_source)) > 0 THEN
01023         rel_source_filtered := '_madlib_' || split_part(rel_source, '.', 2) || '_filtered';
01024     ELSE
01025     rel_source_filtered := '_madlib_' || rel_source || '_filtered';
01026     END IF;
01027 
01028     oldClientMinMessages :=
01029         (SELECT setting FROM pg_settings WHERE name = 'client_min_messages');
01030     EXECUTE 'SET client_min_messages TO warning';
01031     EXECUTE 'DROP VIEW IF EXISTS _madlib_'||rel_source_filtered||'_filtered';
01032     EXECUTE 'DROP VIEW IF EXISTS '||rel_source_filtered;
01033     EXECUTE 'CREATE TEMP VIEW '||rel_source_filtered||'
01034              AS SELECT * FROM '||rel_source||'
01035                     WHERE abs(
01036                               coalesce(
01037                                  MADLIB_SCHEMA.svec_elsum('||expr_point||'),
01038                                  ''Infinity''::FLOAT8
01039                               )
01040                              ) < ''Infinity''::FLOAT8';
01041     EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages;
01042     RETURN rel_source_filtered;
01043     EXCEPTION
01044         WHEN undefined_function THEN
01045         RAISE EXCEPTION 'Point coordinates (%) are not a valid type
01046                         (SVEC, FLOAT[], or INTEGER[]).', expr_point;
01047 END
01048 $$
01049 VOLATILE STRICT
01050 LANGUAGE PLPGSQL;
01051 
01052 /**
01053  * @brief Perform Lloyd's k-means local-search heuristic, but with initial
01054  *     centroids stored in a table
01055  *
01056  * This is a shortcut for running k-means with initial centroids stored in a
01057  * table (as opposed to an array of centroids). It is equivalent
01058  * to
01059  * <pre>SELECT \ref kmeans(
01060     rel_source,
01061     expr_point,
01062     (SELECT \ref matrix_agg($expr_centroid) FROM $rel_initial_centroids),
01063     fn_dist,
01064     agg_centroid,
01065     max_num_iterations,
01066     min_frac_reassigned
01067 )</pre>
01068  * where <tt>$expr_centroid</tt> and <tt>$rel_initial_centroids</tt> denote
01069  * textual substituions.
01070  */
01071 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
01072     rel_source VARCHAR,
01073     expr_point VARCHAR,
01074     rel_initial_centroids VARCHAR,
01075     expr_centroid VARCHAR,
01076     fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */,
01077     agg_centroid VARCHAR /*+ DEFAULT 'avg' */,
01078     max_num_iterations INTEGER /*+ DEFAULT 20 */,
01079     min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */
01080 ) RETURNS MADLIB_SCHEMA.kmeans_result
01081 VOLATILE
01082 STRICT
01083 LANGUAGE plpgsql
01084 AS $$
01085 DECLARE
01086     class_rel_initial_centroids REGCLASS;
01087     theResult MADLIB_SCHEMA.kmeans_result;
01088 BEGIN
01089     class_rel_initial_centroids := rel_initial_centroids;
01090     SELECT * FROM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$
01091         SELECT MADLIB_SCHEMA.kmeans(
01092             $1, $2,
01093             (
01094                 SELECT MADLIB_SCHEMA.matrix_agg(($sql$ || expr_centroid || $sql$)::FLOAT8[])
01095                 FROM $sql$ || textin(regclassout(class_rel_initial_centroids))
01096                     || $sql$
01097             ),
01098             $3, $4, $5, $6)
01099             $sql$,
01100         rel_source, expr_point,
01101         fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned)
01102         INTO theResult;
01103     RETURN theResult;
01104 END;
01105 $$;
01106 
01107 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
01108     rel_source VARCHAR,
01109     expr_point VARCHAR,
01110     rel_initial_centroids VARCHAR,
01111     expr_centroid VARCHAR,
01112     fn_dist VARCHAR,
01113     agg_centroid VARCHAR,
01114     max_num_iterations INTEGER
01115 ) RETURNS MADLIB_SCHEMA.kmeans_result
01116 VOLATILE
01117 STRICT
01118 LANGUAGE sql AS $$
01119     SELECT MADLIB_SCHEMA.kmeans(
01120         $1, $2,
01121         $3, $4, $5, $6, $7, 0.001)
01122 $$;
01123 
01124 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
01125     rel_source VARCHAR,
01126     expr_point VARCHAR,
01127     rel_initial_centroids VARCHAR,
01128     expr_centroid VARCHAR,
01129     fn_dist VARCHAR,
01130     agg_centroid VARCHAR
01131 ) RETURNS MADLIB_SCHEMA.kmeans_result
01132 VOLATILE
01133 STRICT
01134 LANGUAGE sql AS $$
01135     SELECT MADLIB_SCHEMA.kmeans(
01136         $1, $2,
01137         $3, $4, $5, $6, 20, 0.001)
01138 $$;
01139 
01140 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
01141     rel_source VARCHAR,
01142     expr_point VARCHAR,
01143     rel_initial_centroids VARCHAR,
01144     expr_centroid VARCHAR,
01145     fn_dist VARCHAR
01146 ) RETURNS MADLIB_SCHEMA.kmeans_result
01147 VOLATILE
01148 STRICT
01149 LANGUAGE sql AS $$
01150     SELECT MADLIB_SCHEMA.kmeans(
01151         $1, $2,
01152         $3, $4, $5, 'MADLIB_SCHEMA.avg', 20, 0.001)
01153 $$;
01154 
01155 CREATE FUNCTION MADLIB_SCHEMA.kmeans(
01156     rel_source VARCHAR,
01157     expr_point VARCHAR,
01158     rel_initial_centroids VARCHAR,
01159     expr_centroid VARCHAR
01160 ) RETURNS MADLIB_SCHEMA.kmeans_result
01161 VOLATILE
01162 STRICT
01163 LANGUAGE sql AS $$
01164     SELECT MADLIB_SCHEMA.kmeans(
01165         $1, $2,
01166         $3, $4,
01167         'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001)
01168 $$;
01169 
01170 
01171 /**
01172  * @internal
01173  * @brief Execute a SQL command where $1, ..., $3 are substituted with the
01174  *     given arguments.
01175  */
01176 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_silhouette_args(
01177     sql VARCHAR, centroids DOUBLE PRECISION[][], fn_dist REGPROC
01178 ) RETURNS DOUBLE PRECISION
01179 STABLE
01180 CALLED ON NULL INPUT
01181 LANGUAGE c
01182 AS 'MODULE_PATHNAME', 'exec_sql_using';
01183 
01184 /**
01185  * @brief Compute a simplified version of the silhouette coefficient
01186  *
01187  * @param rel_source Name of the relation containing input points
01188  * @param expr_point Expression evaluating to point coordinates \f$ x_i \f$ for
01189  *     each tuple
01190  * @param centroids Matrix \f$ M = (\vec{m_0} \dots \vec{m_{k-1}})
01191  *     \in \mathbb{R}^{d \times k} \f$ with \f$ k \f$ columns, where column
01192  *     \f$ i \f$ contains the position of centroid \f$ i \f$.
01193  * @param fn_dist Name of a function with signature
01194  *     <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that
01195  *     returns the distance between two points
01196  * @return For each point \f$ x_i \f$, let
01197  *     \f$ d_1( x_i ) \f$ and \f$ d_2( x_i ) \f$ be the distance to the closest
01198  *     and 2nd-closest centroid, respectively. If there is more than one
01199  *     closest centroids then \f$ d_1( x_i ) = d_2( x_i )\f$.
01200  *     The return value is the average, over all points \f$ x_i \f$, of
01201  *     \f[
01202  *         \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)},
01203  *     \f]
01204  *     where 0/0 is interpreted as 0.
01205  *     Clearly, the simplified silhouette coefficient assumes values in
01206  *     \f$ [0,1] \f$.
01207  */
01208 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette(
01209     rel_source VARCHAR,
01210     expr_point VARCHAR,
01211     centroids DOUBLE PRECISION[][],
01212     fn_dist VARCHAR /*+ DEFAULT 'dist_norm2' */
01213 ) RETURNS DOUBLE PRECISION
01214 STABLE
01215 STRICT
01216 LANGUAGE plpgsql
01217 AS $$
01218 DECLARE
01219     class_rel_source REGCLASS;
01220     proc_fn_dist REGPROCEDURE;
01221     rel_filtered VARCHAR;
01222 BEGIN
01223     IF (array_upper(centroids,1) IS NULL) THEN
01224     RAISE EXCEPTION 'No valid centroids given.';
01225     END IF;
01226 
01227     rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
01228     class_rel_source := rel_filtered;
01229     proc_fn_dist := fn_dist
01230         || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
01231     IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
01232         FROM pg_proc WHERE oid = proc_fn_dist) THEN
01233         RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
01234     END IF;
01235 
01236     RETURN MADLIB_SCHEMA.internal_execute_using_silhouette_args($sql$
01237         SELECT
01238             avg(CASE
01239                     WHEN distances[2] = 0 THEN 0
01240                     ELSE (distances[2] - distances[1]) / distances[2]
01241                 END)
01242         FROM (
01243             SELECT
01244                 (MADLIB_SCHEMA.closest_columns(
01245                     $1,
01246                     ($sql$ || expr_point || $sql$)::FLOAT8[],
01247                     2::INTEGER,
01248                     $2
01249                 )).distances
01250             FROM
01251                 $sql$ || textin(regclassout(class_rel_source)) || $sql$
01252         ) AS two_shortest_distances
01253         $sql$,
01254         centroids, proc_fn_dist);
01255 END;
01256 $$;
01257 
01258 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette(
01259     rel_source VARCHAR,
01260     expr_point VARCHAR,
01261     centroids DOUBLE PRECISION[][]
01262 ) RETURNS DOUBLE PRECISION
01263 STABLE
01264 STRICT
01265 LANGUAGE sql
01266 AS $$
01267     SELECT MADLIB_SCHEMA.simple_silhouette($1, $2, $3,
01268         'MADLIB_SCHEMA.dist_norm2')
01269 $$;