MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file kmeans.sql_in 00004 * 00005 * @brief Set of functions for k-means clustering. 00006 * 00007 * @sa For a brief introduction to k-means clustering, see the module 00008 * description \ref grp_kmeans. 00009 * 00010 *//* ----------------------------------------------------------------------- */ 00011 00012 m4_include(`SQLCommon.m4') 00013 00014 /** 00015 @addtogroup grp_kmeans 00016 00017 @about 00018 00019 Clustering refers to the problem of partitioning a set of objects according to 00020 some problem-dependent measure of <em>similarity</em>. In the k-means variant, 00021 one is given \f$ n \f$ points \f$ x_1, \dots, x_n \in \mathbb R^d \f$, and the 00022 goal is to position \f$ k \f$ centroids \f$ c_1, \dots, c_k \in \mathbb R^d \f$ 00023 so that the sum of \em distances between each point and its closest centroid is 00024 minimized. Each centroid represents a cluster that consists of all points to 00025 which this centroid is closest. Formally, we wish to minimize the following 00026 objective function: 00027 \f[ 00028 (c_1, \dots, c_k) \mapsto \sum_{i=1}^n \min_{j=1}^k \operatorname{dist}(x_i, c_j) 00029 \f] 00030 In the most common case, \f$ \operatorname{dist} \f$ is the square of the 00031 Euclidean distance. 00032 00033 This problem is computationally difficult (NP-hard), yet the 00034 local-search heuristic proposed by Lloyd [4] performs reasonably well in 00035 practice. In fact, it is so ubiquitous today that it is 00036 often referred to as the <em>standard algorithm</em> or even just the 00037 <em>k-means algorithm</em> [1]. It works as follows: 00038 00039 -# Seed the \f$ k \f$ centroids (see below) 00040 -# Repeat until convergence: 00041 -# Assign each point to its closest centroid 00042 -# Move each centroid to a position that minimizes the sum of distances in this 00043 cluster 00044 -# Convergence is achieved when no points change their assignments during step 00045 2a. 00046 00047 Since the objective function decreases in every step, this algorithm is 00048 guaranteed to converge to a local optimum. 00049 00050 @implementation 00051 00052 Data points and predefined centroids (if used) are expected to be stored row-wise, 00053 in a column of type <tt>\ref grp_svec "SVEC"</tt> (or any type convertible to 00054 <tt>\ref grp_svec "SVEC"</tt>, like <tt>FLOAT[]</tt> or <tt>INTEGER[]</tt>). 00055 Data points with non-finite values (NULL, NaN, infinity) in any component 00056 will be skipped during analysis. 00057 00058 The following methods are available for the centroid seeding: 00059 - <strong>random selection</strong>: 00060 Select \f$ k \f$ centroids randomly among the input points. 00061 - <strong>kmeans++</strong> [2]: 00062 Start with a single centroid chosen randomly among the input points. Then 00063 iteratively choose new 00064 centroids from the input points until there is a total of \f$ k \f$ 00065 centroids. The probability for picking a particular point is proportional to 00066 its minimum distance to any existing centroid. 00067 \n 00068 Intuitively, kmeans++ favors seedings where centroids are spread out over the 00069 whole range of the input points, while at the same time not being too 00070 susceptible to outliers [2]. 00071 - <strong>user-specified set of initial centroids</strong>: 00072 See below for a description of the expected format of the set of initial 00073 centroids. 00074 00075 The following distance functions can be used (computation of barycenter/mean in parentheses): 00076 - <strong>\ref dist_norm1</strong>: 1-norm/Manhattan (element-wise median [Note that MADlib does not provide a median 00077 aggregate function for support and performance reasons.]) 00078 - <strong>\ref dist_norm2</strong>: 2-norm/Euclidean (element-wise mean) 00079 - <strong>\ref squared_dist_norm2</strong>: squared Euclidean distance (element-wise mean) 00080 - <strong>\ref dist_angle</strong>: angle (element-wise mean of normalized points) 00081 - <strong>\ref dist_tanimoto</strong>: tanimoto (element-wise mean of normalized points [5]) 00082 - <strong>user defined function</strong> with signature DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION 00083 00084 The following aggregate functions for determining centroids can be used: 00085 - <strong>\ref avg</strong>: average 00086 - <strong>\ref normalized_avg</strong>: normalized average 00087 00088 The algorithm stops when one of the following conditions is met: 00089 - The fraction of updated points is smaller than the convergence threshold (default: 0.001). 00090 - The algorithm reaches the maximum number of allowed iterations (default: 20). 00091 00092 A popular method to assess the quality of the clustering is the 00093 <em>silhouette coefficient</em>, a simplified version of which is provided as part of the k-means module. 00094 Note that for large data sets, this computation is expensive. 00095 00096 @input 00097 The <strong>source relation</strong> is expected to be of the following form (or 00098 to be implicitly convertible into the following form): 00099 <pre>{TABLE|VIEW} <em>rel_source</em> ( 00100 ... 00101 <em>expr_points</em> FLOAT8[], 00102 ... 00103 )</pre> 00104 where: 00105 - <em>expr_points</em> is the name of a column with point coordinates. 00106 Types such as \c svec or <tt>INTEGER[]</tt> are implicitly converted to 00107 <tt>FLOAT8[]</tt>. 00108 00109 If kmeans is called with a set of initial centroids, the centroid relation is 00110 expected to be of the following form: 00111 <pre>{TABLE|VIEW} <em>rel_initial_centroids</em> ( 00112 ... 00113 <em>expr_centroid</em> DOUBLE PRECISION[], 00114 ... 00115 )</pre> 00116 where: 00117 - <em>expr_centroid</em> is the name of a column with coordinates. 00118 00119 @usage 00120 The k-means algorithm can be invoked in four possible ways: 00121 00122 - using <em>random</em> centroid seeding method for a 00123 provided \f$ k \f$: 00124 <pre>SELECT * FROM \ref kmeans_random( 00125 '<em>rel_source</em>', '<em>expr_point</em>', k, 00126 [ '<em>fn_dist</em>', '<em>agg_centroid</em>', 00127 <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ] 00128 );</pre> 00129 00130 - using <em>kmeans++</em> centroid seeding method for a 00131 provided \f$ k \f$: 00132 <pre>SELECT * FROM \ref kmeanspp( 00133 '<em>rel_source</em>', '<em>expr_point</em>', k, 00134 [ '<em>fn_dist</em>', '<em>agg_centroid</em>', 00135 <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ] 00136 );</pre> 00137 00138 - with a provided centroid set: 00139 <pre>SELECT * FROM \ref kmeans( 00140 '<em>rel_source</em>', '<em>expr_point</em>', 00141 '<em>rel_initial_centroids</em>', '<em>expr_centroid</em>', 00142 [ '<em>fn_dist</em>', '<em>agg_centroid</em>', 00143 <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ] 00144 );</pre> 00145 ------------ OR --------------- 00146 <pre>SELECT * FROM \ref kmeans( 00147 '<em>rel_source</em>', '<em>expr_point</em>', 00148 initial_centroids, 00149 [ '<em>fn_dist</em>', '<em>agg_centroid</em>', 00150 <em>max_num_iterations</em>, <em>min_frac_reassigned</em> ] 00151 );</pre> 00152 where: 00153 - <em>initial_centroids</em> is of type <tt>DOUBLE PRECISION[][]</tt>. 00154 00155 The output of the k-means module is a table that includes the final 00156 centroid positions (DOUBLE PRECISION[][]), the objective function, 00157 the fraction of reassigned points in the last iteration, and 00158 the number of total iterations: 00159 <pre> 00160 centroids | objective_fn | frac_reassigned | num_iterations 00161 ----------------------------------+------------------+-----------------+---------------- 00162 ... 00163 </pre> 00164 00165 @examp 00166 00167 -# Prepare some input data. 00168 \code 00169 sql> SELECT * FROM public.km_sample LIMIT 5; 00170 points 00171 ------------------------------------------- 00172 {1,1}:{15.8822241332382,105.945462542586} 00173 {1,1}:{34.5065216883086,72.3126099305227} 00174 {1,1}:{22.5074400822632,95.3209559689276} 00175 {1,1}:{70.2589857042767,68.7395178806037} 00176 {1,1}:{30.9844257542863,25.3213323024102} 00177 (5 rows) 00178 \endcode 00179 Note: the example <em>points</em> is type <tt>\ref grp_svec "SVEC"</tt>. 00180 -# Run k-means clustering using kmeans++ for centroid seeding: 00181 \code 00182 sql> SELECT * FROM madlib.kmeanspp('km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001); 00183 ); 00184 centroids | objective_fn | frac_reassigned | num_iterations 00185 -------------------------------------------------------------------------+------------------+-----------------+---------------- 00186 {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | 5 00187 \endcode 00188 -# Calculate the simplified silhouette coefficient: 00189 \code 00190 sql> SELECT * from madlib.simple_silhouette('km_test_svec','points', 00191 (select centroids from madlib.kmeanspp('km_test_svec','points',2,'madlib.squared_dist_norm2','madlib.avg',20,0.001)), 00192 'madlib.dist_norm2'); 00193 simple_silhouette 00194 ------------------- 00195 0.611022970398174 00196 \endcode 00197 00198 @literature 00199 00200 [1] Wikipedia, K-means Clustering, 00201 http://en.wikipedia.org/wiki/K-means_clustering 00202 00203 [2] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful 00204 seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete 00205 Algorithms (SODA'07), pp. 1027-1035, 00206 http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf 00207 00208 [3] E. R. Hruschka, L. N. C. Silva, R. J. G. B. Campello: Clustering 00209 Gene-Expression Data: A Hybrid Approach that Iterates Between k-Means and 00210 Evolutionary Search. In: Studies in Computational Intelligence - Hybrid 00211 Evolutionary Algorithms. pp. 313-335. Springer. 2007. 00212 00213 [4] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell 00214 Laboratories. Published much later in: IEEE Transactions on Information 00215 Theory 28(2), pp. 128-137. 1982. 00216 00217 [5] Leisch, Friedrich: A Toolbox for K-Centroids Cluster Analysis. In: Computational 00218 Statistics and Data Analysis, 51(2). pp. 526-544. 2006. 00219 00220 @sa File kmeans.sql_in documenting the SQL functions. 00221 00222 @internal 00223 @sa namespace kmeans (documenting the implementation in Python) 00224 @endinternal 00225 */ 00226 00227 /* 00228 * @brief k-Means return type 00229 * 00230 * A composite value: 00231 * - <tt>centroids</tt> - Matrix containing the new \f$ l \leq k \f$ 00232 * repositioned centroids as columns. If this matrix has \f$ l < k \f$ 00233 * columns, one or more old centroids no longer were closest to any point. 00234 * - <tt>old_centroid_its</tt> - The order of the centroids in 00235 * <tt>centroid</tt> is not guaranteed to be consitent across iterations. 00236 * In particular, if a centroid is no longer closest to any point it can be 00237 * dropped and a new centroid is added afterwards. We therefore need to map 00238 * positions in <tt>centroids</tt> to the respective positions in the 00239 * previous iteration. 00240 * - <tt>objective_fn</tt> - Value of the objective function, i.e., 00241 * \f$ \sum_{x \in P} \dist(x, C)^2 \f$ where 00242 * \f$ P \f$ is the set of points, \f$ C \f$ is the set of centroids, and 00243 * \f$ \dist(x, C) := \min_{c \in C} \operatorname{dist}(x, c) \f$. 00244 * - <tt>frac_reassigned</tt> - Fraction of points that was assigned a 00245 * different centroid in the current iteration. 00246 * - <tt>num_iterations</tt> - Number of iterations performed (so far). 00247 */ 00248 CREATE TYPE MADLIB_SCHEMA.kmeans_result AS ( 00249 centroids DOUBLE PRECISION[][], 00250 objective_fn DOUBLE PRECISION, 00251 frac_reassigned DOUBLE PRECISION, 00252 num_iterations INTEGER 00253 ); 00254 00255 /* 00256 * @brief k-Means inter-iteration state type 00257 * 00258 * A composite value like \ref{kmeans_result}. Additional fields: 00259 * - <tt>old_centroid_its</tt> - The order of the centroids in 00260 * <tt>centroid</tt> is not guaranteed to be consitent across iterations. 00261 * In particular, if a centroid is no longer closest to any point it can be 00262 * dropped and a new centroid is added afterwards. We therefore need to map 00263 * positions in <tt>centroids</tt> to the respective positions in the 00264 * previous iteration. 00265 * - <tt>num_iterations</tt> - Number of iterations performed (so far). 00266 */ 00267 CREATE TYPE MADLIB_SCHEMA.kmeans_state AS ( 00268 centroids DOUBLE PRECISION[][], 00269 old_centroid_ids INTEGER[], 00270 objective_fn DOUBLE PRECISION, 00271 frac_reassigned DOUBLE PRECISION 00272 ); 00273 00274 /** 00275 * @internal 00276 * @brief Execute a SQL command where $1, ..., $4 are substituted with the 00277 * given arguments. 00278 */ 00279 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args( 00280 sql VARCHAR, DOUBLE PRECISION[][], REGPROC, INTEGER, DOUBLE PRECISION 00281 ) RETURNS VOID 00282 VOLATILE 00283 CALLED ON NULL INPUT 00284 LANGUAGE c 00285 AS 'MODULE_PATHNAME', 'exec_sql_using'; 00286 00287 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans( 00288 rel_args VARCHAR, 00289 rel_state VARCHAR, 00290 rel_source VARCHAR, 00291 expr_point VARCHAR, 00292 agg_centroid VARCHAR) 00293 RETURNS INTEGER 00294 VOLATILE 00295 LANGUAGE plpythonu 00296 AS $$PythonFunction(kmeans, kmeans, compute_kmeans)$$; 00297 00298 /** 00299 * @brief Perform Lloyd's k-means local-search heuristic 00300 * 00301 * @param rel_source Name of the relation containing input points 00302 * @param expr_point Expression evaluating to point coordinates for each tuple 00303 * @param initial_centroids Matrix containing the initial centroids as columns 00304 * @param fn_dist Name of a function with signature 00305 * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that 00306 * returns the distance between two points. The default is the 00307 * \ref squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". 00308 * @param agg_centroid Name of an aggregate function with signature 00309 * <tt>DOUBLE PRECISION[] -> DOUBLE PRECISION[]</tt> that, for each group 00310 * of points, returns a centroid. In order for Lloyd's local-search 00311 * heuristic to provably converge and to return a local minimum, this 00312 * centroid should minimize the sum of distances between each point in the 00313 * group and the centroid. The default is the 00314 * \ref avg(float8[]) "average (mean/barycenter in Euclidean space)", 00315 * which satisfies this property if <tt>fn_dist = 'squared_dist_norm2'</tt>. 00316 * @param max_num_iterations Maximum number of iterations 00317 * @param min_frac_reassigned Fraction of reassigned points below which 00318 * convergence is assumed and the algorithm terminates 00319 * @returns A composite value: 00320 * - <tt>centroids</tt> - Matrix with \f$ k \f$ centroids as columns. 00321 * - <tt>frac_reassigned</tt> - Fraction of points that were assigned a 00322 * different centroid in the last iteration. 00323 * - <tt>num_iterations</tt> - The number of iterations before the 00324 * algorithm terminated 00325 */ 00326 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 00327 rel_source VARCHAR, 00328 expr_point VARCHAR, 00329 initial_centroids DOUBLE PRECISION[][], 00330 fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */, 00331 agg_centroid VARCHAR /*+ DEFAULT 'avg' */, 00332 max_num_iterations INTEGER /*+ DEFAULT 20 */, 00333 min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */ 00334 ) RETURNS MADLIB_SCHEMA.kmeans_result AS $$ 00335 DECLARE 00336 theIteration INTEGER; 00337 theResult MADLIB_SCHEMA.kmeans_result; 00338 oldClientMinMessages VARCHAR; 00339 class_rel_source REGCLASS; 00340 proc_fn_dist REGPROCEDURE; 00341 proc_agg_centroid REGPROCEDURE; 00342 rel_filtered VARCHAR; 00343 num_points INTEGER; 00344 k INTEGER; 00345 centroids FLOAT8[]; 00346 BEGIN 00347 IF (array_upper(initial_centroids,1) IS NULL) THEN 00348 RAISE EXCEPTION 'No valid initial centroids given.'; 00349 END IF; 00350 00351 centroids := ARRAY(SELECT unnest(initial_centroids)); 00352 IF (SELECT MADLIB_SCHEMA.svec_elsum(centroids)) >= 'Infinity'::float THEN 00353 RAISE EXCEPTION 'At least one initial centroid has non-finite values.'; 00354 END IF; 00355 00356 rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point); 00357 class_rel_source := rel_filtered; 00358 proc_fn_dist := fn_dist 00359 || '(DOUBLE PRECISION[], DOUBLE PRECISION[])'; 00360 IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE 00361 FROM pg_proc WHERE oid = proc_fn_dist) THEN 00362 RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.'; 00363 END IF; 00364 proc_agg_centroid := agg_centroid || '(DOUBLE PRECISION[])'; 00365 IF (SELECT prorettype != 'DOUBLE PRECISION[]'::regtype OR proisagg = FALSE 00366 FROM pg_proc WHERE oid = proc_agg_centroid) THEN 00367 RAISE EXCEPTION 'Mean aggregate has wrong signature or is not an aggregate.'; 00368 END IF; 00369 IF (min_frac_reassigned < 0) OR (min_frac_reassigned > 1) THEN 00370 RAISE EXCEPTION 'Convergence threshold is not a valid value (must be a fraction between 0 and 1).'; 00371 END IF; 00372 IF (max_num_iterations < 0) THEN 00373 RAISE EXCEPTION 'Number of iterations must be a non-negative integer.'; 00374 END IF; 00375 00376 -- Extra parameter check added so that ERROR output is more user-readable (doesn't include Python traceback) 00377 k := array_upper(initial_centroids,1); 00378 IF (k <= 0) THEN 00379 RAISE EXCEPTION 'Number of clusters k must be a positive integer.'; 00380 END IF; 00381 IF (k > 32767) THEN 00382 RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).'; 00383 END IF; 00384 EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ; 00385 IF (num_points < k) THEN 00386 RAISE EXCEPTION 'Number of centroids is greater than number of points.'; 00387 END IF; 00388 00389 -- We first setup the argument table. Rationale: We want to avoid all data 00390 -- conversion between native types and Python code. Instead, we use Python 00391 -- as a pure driver layer. 00392 PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); 00393 oldClientMinMessages := 00394 (SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); 00395 EXECUTE 'SET client_min_messages TO warning'; 00396 00397 -- Unfortunately, the EXECUTE USING syntax is only available starting 00398 -- PostgreSQL 8.4: 00399 -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN 00400 -- We therefore have to emulate. 00401 PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$ 00402 DROP TABLE IF EXISTS pg_temp._madlib_kmeans_args; 00403 CREATE TABLE pg_temp._madlib_kmeans_args AS 00404 SELECT 00405 $1 AS initial_centroids, array_upper($1, 1) AS k, 00406 $2 AS fn_dist, $3 AS max_num_iterations, 00407 $4 AS min_frac_reassigned; 00408 $sql$, 00409 initial_centroids, proc_fn_dist, max_num_iterations, 00410 min_frac_reassigned); 00411 EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages; 00412 00413 -- Perform acutal computation. 00414 -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion 00415 -- operators from regclass to varchar/text. 00416 theIteration := MADLIB_SCHEMA.internal_compute_kmeans('_madlib_kmeans_args', 00417 '_madlib_kmeans_state', 00418 textin(regclassout(class_rel_source)), expr_point, 00419 textin(regprocout(proc_agg_centroid))); 00420 00421 -- Retrieve result from state table and return it 00422 EXECUTE 00423 $sql$ 00424 SELECT (_state).centroids, (_state).objective_fn, 00425 (_state).frac_reassigned, NULL 00426 FROM _madlib_kmeans_state 00427 WHERE _iteration = $sql$ || theIteration || $sql$ 00428 $sql$ 00429 INTO theResult; 00430 -- The number of iterations are not updated in the C++ code. We do it here. 00431 IF NOT (theResult IS NULL) THEN 00432 theResult.num_iterations = theIteration; 00433 END IF; 00434 RETURN theResult; 00435 END; 00436 $$ LANGUAGE plpgsql VOLATILE; 00437 00438 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 00439 rel_source VARCHAR, 00440 expr_point VARCHAR, 00441 initial_centroids DOUBLE PRECISION[][], 00442 fn_dist VARCHAR, 00443 agg_centroid VARCHAR, 00444 max_num_iterations INTEGER 00445 ) RETURNS MADLIB_SCHEMA.kmeans_result 00446 VOLATILE 00447 STRICT 00448 LANGUAGE sql AS $$ 00449 SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, $6, 0.001) 00450 $$; 00451 00452 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 00453 rel_source VARCHAR, 00454 expr_point VARCHAR, 00455 initial_centroids DOUBLE PRECISION[][], 00456 fn_dist VARCHAR, 00457 agg_centroid VARCHAR 00458 ) RETURNS MADLIB_SCHEMA.kmeans_result 00459 VOLATILE 00460 STRICT 00461 LANGUAGE sql AS $$ 00462 SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, $5, 20, 0.001) 00463 $$; 00464 00465 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 00466 rel_source VARCHAR, 00467 expr_point VARCHAR, 00468 initial_centroids DOUBLE PRECISION[][], 00469 fn_dist VARCHAR 00470 ) RETURNS MADLIB_SCHEMA.kmeans_result 00471 VOLATILE 00472 STRICT 00473 LANGUAGE sql AS $$ 00474 SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, $4, 'MADLIB_SCHEMA.avg', 20, 00475 0.001) 00476 $$; 00477 00478 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 00479 rel_source VARCHAR, 00480 expr_point VARCHAR, 00481 initial_centroids DOUBLE PRECISION[][] 00482 ) RETURNS MADLIB_SCHEMA.kmeans_result 00483 VOLATILE 00484 STRICT 00485 LANGUAGE sql AS $$ 00486 SELECT MADLIB_SCHEMA.kmeans($1, $2, $3, 00487 'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001) 00488 $$; 00489 00490 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args( 00491 sql VARCHAR, INTEGER, REGPROC, DOUBLE PRECISION[][] 00492 ) RETURNS VOID 00493 VOLATILE 00494 CALLED ON NULL INPUT 00495 LANGUAGE c 00496 AS 'MODULE_PATHNAME', 'exec_sql_using'; 00497 00498 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeanspp_seeding( 00499 rel_args VARCHAR, 00500 rel_state VARCHAR, 00501 rel_source VARCHAR, 00502 expr_point VARCHAR) 00503 RETURNS INTEGER 00504 AS $$PythonFunction(kmeans, kmeans, compute_kmeanspp_seeding)$$ 00505 LANGUAGE plpythonu VOLATILE; 00506 00507 /** 00508 * @brief k-Means++ Seeding 00509 * 00510 * @param rel_source Name of the relation containing input points 00511 * @param expr_point Expression evaluating to point coordinates for each tuple 00512 * @param k Number of centroids 00513 * @param fn_dist Name of a function with signature 00514 * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that 00515 * returns the distance between two points 00516 * @param initial_centroids A matrix containing up to \f$ k \f$ columns as 00517 * columns. kmeanspp_seeding() proceeds exactly as if these centroids had 00518 * already been generated in previous iterations. This parameter may be 00519 * NULL in which all \f$ k \f$ centroids will be generated. 00520 * @returns A matrix containing \f$ k \f$ centroids as columns 00521 */ 00522 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding( 00523 rel_source VARCHAR, 00524 expr_point VARCHAR, 00525 k INTEGER, 00526 fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */, 00527 initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */ 00528 ) RETURNS DOUBLE PRECISION[][] AS $$ 00529 DECLARE 00530 theIteration INTEGER; 00531 theResult DOUBLE PRECISION[][]; 00532 oldClientMinMessages VARCHAR; 00533 class_rel_source REGCLASS; 00534 proc_fn_dist REGPROCEDURE; 00535 num_points INTEGER; 00536 num_centroids INTEGER; 00537 rel_filtered VARCHAR; 00538 BEGIN 00539 rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point); 00540 class_rel_source := rel_filtered; 00541 00542 IF (initial_centroids IS NOT NULL) THEN 00543 num_centroids := array_upper(initial_centroids,1); 00544 ELSE 00545 num_centroids := k; 00546 END IF; 00547 00548 proc_fn_dist := fn_dist 00549 || '(DOUBLE PRECISION[], DOUBLE PRECISION[])'; 00550 IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE 00551 FROM pg_proc WHERE oid = proc_fn_dist) THEN 00552 RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.'; 00553 END IF; 00554 IF (k <= 0) THEN 00555 RAISE EXCEPTION 'Number of clusters k must be a positive integer.'; 00556 END IF; 00557 IF (k > 32767) THEN 00558 RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).'; 00559 END IF; 00560 EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points ; 00561 IF (num_points < k OR num_points < num_centroids) THEN 00562 RAISE EXCEPTION 'Number of centroids is greater than number of points.'; 00563 END IF; 00564 IF (k < num_centroids) THEN 00565 RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.'; 00566 END IF; 00567 00568 -- We first setup the argument table. Rationale: We want to avoid all data 00569 -- conversion between native types and Python code. Instead, we use Python 00570 -- as a pure driver layer. 00571 oldClientMinMessages := 00572 (SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); 00573 EXECUTE 'SET client_min_messages TO warning'; 00574 PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); 00575 -- Unfortunately, the EXECUTE USING syntax is only available starting 00576 -- PostgreSQL 8.4: 00577 -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN 00578 -- We therefore have to emulate. 00579 PERFORM MADLIB_SCHEMA.internal_execute_using_kmeanspp_seeding_args($sql$ 00580 DROP TABLE IF EXISTS pg_temp._madlib_kmeanspp_args; 00581 CREATE TEMPORARY TABLE _madlib_kmeanspp_args AS 00582 SELECT $1 AS k, $2 AS fn_dist, $3 AS initial_centroids; 00583 $sql$, 00584 k, proc_fn_dist, initial_centroids); 00585 EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages; 00586 00587 -- Perform acutal computation. 00588 -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion 00589 -- operators from regclass to varchar/text. 00590 theIteration := ( 00591 SELECT MADLIB_SCHEMA.internal_compute_kmeanspp_seeding( 00592 '_madlib_kmeanspp_args', '_madlib_kmeanspp_state', 00593 textin(regclassout(class_rel_source)), expr_point) 00594 ); 00595 00596 -- Retrieve result from state table and return it 00597 EXECUTE 00598 $sql$ 00599 SELECT _state FROM _madlib_kmeanspp_state 00600 WHERE _iteration = $sql$ || theIteration || $sql$ 00601 $sql$ 00602 INTO theResult; 00603 RETURN theResult; 00604 END; 00605 $$ LANGUAGE plpgsql VOLATILE; 00606 00607 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding( 00608 rel_source VARCHAR, 00609 expr_point VARCHAR, 00610 k INTEGER, 00611 fn_dist VARCHAR 00612 ) RETURNS DOUBLE PRECISION[][] 00613 LANGUAGE sql AS $$ 00614 SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4, NULL) 00615 $$; 00616 00617 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp_seeding( 00618 rel_source VARCHAR, 00619 expr_point VARCHAR, 00620 k INTEGER 00621 ) RETURNS DOUBLE PRECISION[][] 00622 LANGUAGE sql AS $$ 00623 SELECT MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, 00624 'MADLIB_SCHEMA.squared_dist_norm2', NULL) 00625 $$; 00626 00627 /** 00628 * @brief Run k-Means++. 00629 * 00630 * This is a shortcut for running k-means++. It is equivalent to 00631 * <pre>SELECT \ref kmeans( 00632 rel_source, 00633 expr_point, 00634 \ref kmeanspp_seeding( 00635 rel_source, 00636 expr_point, 00637 k, 00638 fn_dist 00639 ), 00640 fn_dist, 00641 agg_centroid, 00642 max_num_iterations, 00643 min_frac_reassigned 00644 )</pre> 00645 */ 00646 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp( 00647 rel_source VARCHAR, 00648 expr_point VARCHAR, 00649 k INTEGER, 00650 fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */, 00651 agg_centroid VARCHAR /*+ DEFAULT 'avg' */, 00652 max_num_iterations INTEGER /*+ DEFAULT 20 */, 00653 min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */ 00654 ) RETURNS MADLIB_SCHEMA.kmeans_result 00655 VOLATILE 00656 STRICT 00657 LANGUAGE plpgsql 00658 AS $$ 00659 DECLARE 00660 ret MADLIB_SCHEMA.kmeans_result; 00661 BEGIN 00662 ret = MADLIB_SCHEMA.kmeans( 00663 $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4), 00664 $4, $5, $6, $7); 00665 RETURN ret; 00666 END 00667 $$; 00668 00669 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp( 00670 rel_source VARCHAR, 00671 expr_point VARCHAR, 00672 k INTEGER, 00673 fn_dist VARCHAR, 00674 agg_centroid VARCHAR, 00675 max_num_iterations INTEGER 00676 ) RETURNS MADLIB_SCHEMA.kmeans_result 00677 VOLATILE 00678 STRICT 00679 LANGUAGE plpgsql 00680 AS $$ 00681 DECLARE 00682 ret MADLIB_SCHEMA.kmeans_result; 00683 BEGIN 00684 ret = MADLIB_SCHEMA.kmeans( 00685 $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4), 00686 $4, $5, $6, 0.001); 00687 RETURN ret; 00688 END 00689 $$; 00690 00691 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp( 00692 rel_source VARCHAR, 00693 expr_point VARCHAR, 00694 k INTEGER, 00695 fn_dist VARCHAR, 00696 agg_centroid VARCHAR 00697 ) RETURNS MADLIB_SCHEMA.kmeans_result 00698 VOLATILE 00699 STRICT 00700 LANGUAGE plpgsql 00701 AS $$ 00702 DECLARE 00703 ret MADLIB_SCHEMA.kmeans_result; 00704 BEGIN 00705 ret = MADLIB_SCHEMA.kmeans( 00706 $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4), 00707 $4, $5, 20, 0.001); 00708 RETURN ret; 00709 END 00710 $$; 00711 00712 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp( 00713 rel_source VARCHAR, 00714 expr_point VARCHAR, 00715 k INTEGER, 00716 fn_dist VARCHAR 00717 ) RETURNS MADLIB_SCHEMA.kmeans_result 00718 VOLATILE 00719 STRICT 00720 LANGUAGE plpgsql 00721 AS $$ 00722 DECLARE 00723 ret MADLIB_SCHEMA.kmeans_result; 00724 BEGIN 00725 ret = MADLIB_SCHEMA.kmeans( 00726 $1, $2, MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, $4), 00727 $4, 'MADLIB_SCHEMA.avg', 20, 0.001); 00728 RETURN ret; 00729 END 00730 $$; 00731 00732 CREATE FUNCTION MADLIB_SCHEMA.kmeanspp( 00733 rel_source VARCHAR, 00734 expr_point VARCHAR, 00735 k INTEGER 00736 ) RETURNS MADLIB_SCHEMA.kmeans_result 00737 VOLATILE 00738 STRICT 00739 LANGUAGE plpgsql 00740 AS $$ 00741 DECLARE 00742 ret MADLIB_SCHEMA.kmeans_result; 00743 BEGIN 00744 ret = MADLIB_SCHEMA.kmeans( 00745 $1, $2, 00746 MADLIB_SCHEMA.kmeanspp_seeding($1, $2, $3, 00747 'MADLIB_SCHEMA.squared_dist_norm2'), 00748 'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001); 00749 RETURN ret; 00750 END 00751 $$; 00752 00753 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args( 00754 sql VARCHAR, INTEGER, DOUBLE PRECISION[][] 00755 ) RETURNS VOID 00756 VOLATILE 00757 CALLED ON NULL INPUT 00758 LANGUAGE c 00759 AS 'MODULE_PATHNAME', 'exec_sql_using'; 00760 00761 CREATE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans_random_seeding( 00762 rel_args VARCHAR, 00763 rel_state VARCHAR, 00764 rel_source VARCHAR, 00765 expr_point VARCHAR) 00766 RETURNS INTEGER 00767 AS $$PythonFunction(kmeans, kmeans, compute_kmeans_random_seeding)$$ 00768 LANGUAGE plpythonu VOLATILE; 00769 00770 /** 00771 * @brief k-Means Random Seeding 00772 * 00773 * @param rel_source Name of the relation containing input points 00774 * @param expr_point Expression evaluating to point coordinates for each tuple 00775 * @param k Number of centroids 00776 * @param initial_centroids A matrix containing up to \f$ k \f$ columns as 00777 * columns. kmeanspp_seeding() proceeds exactly as if these centroids had 00778 * already been generated in previous iterations. This parameter may be 00779 * NULL in which all \f$ k \f$ centroids will be generated. 00780 * @returns A matrix containing \f$ k \f$ centroids as columns 00781 */ 00782 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding( 00783 rel_source VARCHAR, 00784 expr_point VARCHAR, 00785 k INTEGER, 00786 initial_centroids DOUBLE PRECISION[][] /*+ DEFAULT NULL */ 00787 ) RETURNS DOUBLE PRECISION[][] AS $$ 00788 DECLARE 00789 theIteration INTEGER; 00790 theResult DOUBLE PRECISION[][]; 00791 oldClientMinMessages VARCHAR; 00792 class_rel_source REGCLASS; 00793 proc_fn_dist REGPROCEDURE; 00794 num_points INTEGER; 00795 num_centroids INTEGER; 00796 rel_filtered VARCHAR; 00797 BEGIN 00798 rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point); 00799 class_rel_source := rel_filtered; 00800 00801 IF (initial_centroids IS NOT NULL) THEN 00802 num_centroids := array_upper(initial_centroids,1); 00803 ELSE 00804 num_centroids := k; 00805 END IF; 00806 00807 IF (k <= 0) THEN 00808 RAISE EXCEPTION 'Number of clusters k must be a positive integer.'; 00809 END IF; 00810 IF (k > 32767) THEN 00811 RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).'; 00812 END IF; 00813 EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points; 00814 IF (num_points < k OR num_points < num_centroids) THEN 00815 RAISE EXCEPTION 'Number of centroids is greater than number of points.'; 00816 END IF; 00817 IF (k < num_centroids) THEN 00818 RAISE WARNING 'Number of clusters k is less than number of supplied initial centroids. Number of final clusters will equal number of supplied initial centroids.'; 00819 END IF; 00820 00821 -- We first setup the argument table. Rationale: We want to avoid all data 00822 -- conversion between native types and Python code. Instead, we use Python 00823 -- as a pure driver layer. 00824 oldClientMinMessages := 00825 (SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); 00826 EXECUTE 'SET client_min_messages TO warning'; 00827 PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); 00828 -- Unfortunately, the EXECUTE USING syntax is only available starting 00829 -- PostgreSQL 8.4: 00830 -- http://www.postgresql.org/docs/8.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN 00831 -- We therefore have to emulate. 00832 PERFORM MADLIB_SCHEMA.internal_execute_using_kmeans_random_seeding_args($sql$ 00833 DROP TABLE IF EXISTS pg_temp._madlib_kmeans_random_args; 00834 CREATE TEMPORARY TABLE _madlib_kmeans_random_args AS 00835 SELECT $1 AS k, $2 AS initial_centroids; 00836 $sql$, 00837 k, initial_centroids); 00838 EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages; 00839 00840 -- Perform acutal computation. 00841 -- Unfortunately, Greenplum and PostgreSQL <= 8.2 do not have conversion 00842 -- operators from regclass to varchar/text. 00843 theIteration := ( 00844 SELECT MADLIB_SCHEMA.internal_compute_kmeans_random_seeding( 00845 '_madlib_kmeans_random_args', '_madlib_kmeans_random_state', 00846 textin(regclassout(class_rel_source)), expr_point) 00847 ); 00848 00849 -- Retrieve result from state table and return it 00850 EXECUTE 00851 $sql$ 00852 SELECT _state FROM _madlib_kmeans_random_state 00853 WHERE _iteration = $sql$ || theIteration || $sql$ 00854 $sql$ 00855 INTO theResult; 00856 RETURN theResult; 00857 END; 00858 $$ LANGUAGE plpgsql VOLATILE; 00859 00860 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random_seeding( 00861 rel_source VARCHAR, 00862 expr_point VARCHAR, 00863 k INTEGER 00864 ) RETURNS DOUBLE PRECISION[][] 00865 LANGUAGE sql AS $$ 00866 SELECT MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3, NULL) 00867 $$; 00868 00869 /** 00870 * @brief Run k-Means with random seeding. 00871 * 00872 * This is a shortcut for running k-means with random seeding. It is equivalent 00873 * to 00874 * <pre>SELECT \ref kmeans( 00875 rel_source, 00876 expr_point, 00877 \ref kmeans_random_seeding( 00878 rel_source, 00879 expr_point, 00880 k 00881 ), 00882 fn_dist, 00883 agg_centroid, 00884 max_num_iterations, 00885 min_frac_reassigned 00886 )</pre> 00887 */ 00888 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random( 00889 rel_source VARCHAR, 00890 expr_point VARCHAR, 00891 k INTEGER, 00892 fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */, 00893 agg_centroid VARCHAR /*+ DEFAULT 'avg' */, 00894 max_num_iterations INTEGER /*+ DEFAULT 20 */, 00895 min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */ 00896 ) RETURNS MADLIB_SCHEMA.kmeans_result 00897 VOLATILE 00898 STRICT 00899 LANGUAGE plpgsql 00900 AS $$ 00901 DECLARE 00902 ret MADLIB_SCHEMA.kmeans_result; 00903 BEGIN 00904 ret = MADLIB_SCHEMA.kmeans( 00905 $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3), 00906 $4, $5, $6, $7); 00907 RETURN ret; 00908 END 00909 $$; 00910 00911 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random( 00912 rel_source VARCHAR, 00913 expr_point VARCHAR, 00914 k INTEGER, 00915 fn_dist VARCHAR, 00916 agg_centroid VARCHAR, 00917 max_num_iterations INTEGER 00918 ) RETURNS MADLIB_SCHEMA.kmeans_result 00919 VOLATILE 00920 STRICT 00921 LANGUAGE plpgsql 00922 AS $$ 00923 DECLARE 00924 ret MADLIB_SCHEMA.kmeans_result; 00925 BEGIN 00926 ret = MADLIB_SCHEMA.kmeans( 00927 $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3), 00928 $4, $5, $6, 0.001); 00929 RETURN ret; 00930 END 00931 $$; 00932 00933 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random( 00934 rel_source VARCHAR, 00935 expr_point VARCHAR, 00936 k INTEGER, 00937 fn_dist VARCHAR, 00938 agg_centroid VARCHAR 00939 ) RETURNS MADLIB_SCHEMA.kmeans_result 00940 VOLATILE 00941 STRICT 00942 LANGUAGE plpgsql 00943 AS $$ 00944 DECLARE 00945 ret MADLIB_SCHEMA.kmeans_result; 00946 BEGIN 00947 ret = MADLIB_SCHEMA.kmeans( 00948 $1, $2, MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3), 00949 $4, $5, 20, 0.001); 00950 RETURN ret; 00951 END 00952 $$; 00953 00954 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random( 00955 rel_source VARCHAR, 00956 expr_point VARCHAR, 00957 k INTEGER, 00958 fn_dist VARCHAR 00959 ) RETURNS MADLIB_SCHEMA.kmeans_result 00960 VOLATILE 00961 STRICT 00962 LANGUAGE plpgsql 00963 AS $$ 00964 DECLARE 00965 ret MADLIB_SCHEMA.kmeans_result; 00966 BEGIN 00967 ret = MADLIB_SCHEMA.kmeans( 00968 $1, $2, 00969 MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3), 00970 $4, 'MADLIB_SCHEMA.avg', 20, 0.001); 00971 RETURN ret; 00972 END 00973 $$; 00974 00975 CREATE FUNCTION MADLIB_SCHEMA.kmeans_random( 00976 rel_source VARCHAR, 00977 expr_point VARCHAR, 00978 k INTEGER 00979 ) RETURNS MADLIB_SCHEMA.kmeans_result 00980 VOLATILE 00981 STRICT 00982 LANGUAGE plpgsql 00983 AS $$ 00984 DECLARE 00985 ret MADLIB_SCHEMA.kmeans_result; 00986 BEGIN 00987 ret = MADLIB_SCHEMA.kmeans( 00988 $1, $2, 00989 MADLIB_SCHEMA.kmeans_random_seeding($1, $2, $3), 00990 'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001); 00991 RETURN ret; 00992 END 00993 $$; 00994 00995 /** 00996 * @internal 00997 * @brief Execute a SQL command where $1, ..., $6 are substituted with the 00998 * given arguments. 00999 */ 01000 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmeans_args( 01001 sql VARCHAR, rel_source VARCHAR, expr_point VARCHAR, 01002 fn_dist VARCHAR, agg_centroid VARCHAR, max_num_iterations INTEGER, 01003 min_frac_reassigned DOUBLE PRECISION 01004 ) RETURNS MADLIB_SCHEMA.kmeans_result 01005 VOLATILE 01006 CALLED ON NULL INPUT 01007 LANGUAGE c 01008 AS 'MODULE_PATHNAME', 'exec_sql_using'; 01009 01010 /** 01011 * @internal 01012 * @brief Filter out the invalid data points in the original input relation 01013 */ 01014 CREATE FUNCTION MADLIB_SCHEMA.__filter_input_relation( 01015 rel_source VARCHAR, expr_point VARCHAR) 01016 RETURNS VARCHAR 01017 AS $$ 01018 DECLARE 01019 oldClientMinMessages VARCHAR; 01020 rel_source_filtered VARCHAR; 01021 BEGIN 01022 IF (SELECT position('.' in rel_source)) > 0 THEN 01023 rel_source_filtered := '_madlib_' || split_part(rel_source, '.', 2) || '_filtered'; 01024 ELSE 01025 rel_source_filtered := '_madlib_' || rel_source || '_filtered'; 01026 END IF; 01027 01028 oldClientMinMessages := 01029 (SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); 01030 EXECUTE 'SET client_min_messages TO warning'; 01031 EXECUTE 'DROP VIEW IF EXISTS _madlib_'||rel_source_filtered||'_filtered'; 01032 EXECUTE 'DROP VIEW IF EXISTS '||rel_source_filtered; 01033 EXECUTE 'CREATE TEMP VIEW '||rel_source_filtered||' 01034 AS SELECT * FROM '||rel_source||' 01035 WHERE abs( 01036 coalesce( 01037 MADLIB_SCHEMA.svec_elsum('||expr_point||'), 01038 ''Infinity''::FLOAT8 01039 ) 01040 ) < ''Infinity''::FLOAT8'; 01041 EXECUTE 'SET client_min_messages TO ' || oldClientMinMessages; 01042 RETURN rel_source_filtered; 01043 EXCEPTION 01044 WHEN undefined_function THEN 01045 RAISE EXCEPTION 'Point coordinates (%) are not a valid type 01046 (SVEC, FLOAT[], or INTEGER[]).', expr_point; 01047 END 01048 $$ 01049 VOLATILE STRICT 01050 LANGUAGE PLPGSQL; 01051 01052 /** 01053 * @brief Perform Lloyd's k-means local-search heuristic, but with initial 01054 * centroids stored in a table 01055 * 01056 * This is a shortcut for running k-means with initial centroids stored in a 01057 * table (as opposed to an array of centroids). It is equivalent 01058 * to 01059 * <pre>SELECT \ref kmeans( 01060 rel_source, 01061 expr_point, 01062 (SELECT \ref matrix_agg($expr_centroid) FROM $rel_initial_centroids), 01063 fn_dist, 01064 agg_centroid, 01065 max_num_iterations, 01066 min_frac_reassigned 01067 )</pre> 01068 * where <tt>$expr_centroid</tt> and <tt>$rel_initial_centroids</tt> denote 01069 * textual substituions. 01070 */ 01071 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 01072 rel_source VARCHAR, 01073 expr_point VARCHAR, 01074 rel_initial_centroids VARCHAR, 01075 expr_centroid VARCHAR, 01076 fn_dist VARCHAR /*+ DEFAULT 'squared_dist_norm2' */, 01077 agg_centroid VARCHAR /*+ DEFAULT 'avg' */, 01078 max_num_iterations INTEGER /*+ DEFAULT 20 */, 01079 min_frac_reassigned DOUBLE PRECISION /*+ DEFAULT 0.001 */ 01080 ) RETURNS MADLIB_SCHEMA.kmeans_result 01081 VOLATILE 01082 STRICT 01083 LANGUAGE plpgsql 01084 AS $$ 01085 DECLARE 01086 class_rel_initial_centroids REGCLASS; 01087 theResult MADLIB_SCHEMA.kmeans_result; 01088 BEGIN 01089 class_rel_initial_centroids := rel_initial_centroids; 01090 SELECT * FROM MADLIB_SCHEMA.internal_execute_using_kmeans_args($sql$ 01091 SELECT MADLIB_SCHEMA.kmeans( 01092 $1, $2, 01093 ( 01094 SELECT MADLIB_SCHEMA.matrix_agg(($sql$ || expr_centroid || $sql$)::FLOAT8[]) 01095 FROM $sql$ || textin(regclassout(class_rel_initial_centroids)) 01096 || $sql$ 01097 ), 01098 $3, $4, $5, $6) 01099 $sql$, 01100 rel_source, expr_point, 01101 fn_dist, agg_centroid, max_num_iterations, min_frac_reassigned) 01102 INTO theResult; 01103 RETURN theResult; 01104 END; 01105 $$; 01106 01107 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 01108 rel_source VARCHAR, 01109 expr_point VARCHAR, 01110 rel_initial_centroids VARCHAR, 01111 expr_centroid VARCHAR, 01112 fn_dist VARCHAR, 01113 agg_centroid VARCHAR, 01114 max_num_iterations INTEGER 01115 ) RETURNS MADLIB_SCHEMA.kmeans_result 01116 VOLATILE 01117 STRICT 01118 LANGUAGE sql AS $$ 01119 SELECT MADLIB_SCHEMA.kmeans( 01120 $1, $2, 01121 $3, $4, $5, $6, $7, 0.001) 01122 $$; 01123 01124 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 01125 rel_source VARCHAR, 01126 expr_point VARCHAR, 01127 rel_initial_centroids VARCHAR, 01128 expr_centroid VARCHAR, 01129 fn_dist VARCHAR, 01130 agg_centroid VARCHAR 01131 ) RETURNS MADLIB_SCHEMA.kmeans_result 01132 VOLATILE 01133 STRICT 01134 LANGUAGE sql AS $$ 01135 SELECT MADLIB_SCHEMA.kmeans( 01136 $1, $2, 01137 $3, $4, $5, $6, 20, 0.001) 01138 $$; 01139 01140 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 01141 rel_source VARCHAR, 01142 expr_point VARCHAR, 01143 rel_initial_centroids VARCHAR, 01144 expr_centroid VARCHAR, 01145 fn_dist VARCHAR 01146 ) RETURNS MADLIB_SCHEMA.kmeans_result 01147 VOLATILE 01148 STRICT 01149 LANGUAGE sql AS $$ 01150 SELECT MADLIB_SCHEMA.kmeans( 01151 $1, $2, 01152 $3, $4, $5, 'MADLIB_SCHEMA.avg', 20, 0.001) 01153 $$; 01154 01155 CREATE FUNCTION MADLIB_SCHEMA.kmeans( 01156 rel_source VARCHAR, 01157 expr_point VARCHAR, 01158 rel_initial_centroids VARCHAR, 01159 expr_centroid VARCHAR 01160 ) RETURNS MADLIB_SCHEMA.kmeans_result 01161 VOLATILE 01162 STRICT 01163 LANGUAGE sql AS $$ 01164 SELECT MADLIB_SCHEMA.kmeans( 01165 $1, $2, 01166 $3, $4, 01167 'MADLIB_SCHEMA.squared_dist_norm2', 'MADLIB_SCHEMA.avg', 20, 0.001) 01168 $$; 01169 01170 01171 /** 01172 * @internal 01173 * @brief Execute a SQL command where $1, ..., $3 are substituted with the 01174 * given arguments. 01175 */ 01176 CREATE FUNCTION MADLIB_SCHEMA.internal_execute_using_silhouette_args( 01177 sql VARCHAR, centroids DOUBLE PRECISION[][], fn_dist REGPROC 01178 ) RETURNS DOUBLE PRECISION 01179 STABLE 01180 CALLED ON NULL INPUT 01181 LANGUAGE c 01182 AS 'MODULE_PATHNAME', 'exec_sql_using'; 01183 01184 /** 01185 * @brief Compute a simplified version of the silhouette coefficient 01186 * 01187 * @param rel_source Name of the relation containing input points 01188 * @param expr_point Expression evaluating to point coordinates \f$ x_i \f$ for 01189 * each tuple 01190 * @param centroids Matrix \f$ M = (\vec{m_0} \dots \vec{m_{k-1}}) 01191 * \in \mathbb{R}^{d \times k} \f$ with \f$ k \f$ columns, where column 01192 * \f$ i \f$ contains the position of centroid \f$ i \f$. 01193 * @param fn_dist Name of a function with signature 01194 * <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that 01195 * returns the distance between two points 01196 * @return For each point \f$ x_i \f$, let 01197 * \f$ d_1( x_i ) \f$ and \f$ d_2( x_i ) \f$ be the distance to the closest 01198 * and 2nd-closest centroid, respectively. If there is more than one 01199 * closest centroids then \f$ d_1( x_i ) = d_2( x_i )\f$. 01200 * The return value is the average, over all points \f$ x_i \f$, of 01201 * \f[ 01202 * \frac{d_2( x_i ) - d_1(x_i)}{d_2(x_i)}, 01203 * \f] 01204 * where 0/0 is interpreted as 0. 01205 * Clearly, the simplified silhouette coefficient assumes values in 01206 * \f$ [0,1] \f$. 01207 */ 01208 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette( 01209 rel_source VARCHAR, 01210 expr_point VARCHAR, 01211 centroids DOUBLE PRECISION[][], 01212 fn_dist VARCHAR /*+ DEFAULT 'dist_norm2' */ 01213 ) RETURNS DOUBLE PRECISION 01214 STABLE 01215 STRICT 01216 LANGUAGE plpgsql 01217 AS $$ 01218 DECLARE 01219 class_rel_source REGCLASS; 01220 proc_fn_dist REGPROCEDURE; 01221 rel_filtered VARCHAR; 01222 BEGIN 01223 IF (array_upper(centroids,1) IS NULL) THEN 01224 RAISE EXCEPTION 'No valid centroids given.'; 01225 END IF; 01226 01227 rel_filtered = MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point); 01228 class_rel_source := rel_filtered; 01229 proc_fn_dist := fn_dist 01230 || '(DOUBLE PRECISION[], DOUBLE PRECISION[])'; 01231 IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE 01232 FROM pg_proc WHERE oid = proc_fn_dist) THEN 01233 RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.'; 01234 END IF; 01235 01236 RETURN MADLIB_SCHEMA.internal_execute_using_silhouette_args($sql$ 01237 SELECT 01238 avg(CASE 01239 WHEN distances[2] = 0 THEN 0 01240 ELSE (distances[2] - distances[1]) / distances[2] 01241 END) 01242 FROM ( 01243 SELECT 01244 (MADLIB_SCHEMA.closest_columns( 01245 $1, 01246 ($sql$ || expr_point || $sql$)::FLOAT8[], 01247 2::INTEGER, 01248 $2 01249 )).distances 01250 FROM 01251 $sql$ || textin(regclassout(class_rel_source)) || $sql$ 01252 ) AS two_shortest_distances 01253 $sql$, 01254 centroids, proc_fn_dist); 01255 END; 01256 $$; 01257 01258 CREATE FUNCTION MADLIB_SCHEMA.simple_silhouette( 01259 rel_source VARCHAR, 01260 expr_point VARCHAR, 01261 centroids DOUBLE PRECISION[][] 01262 ) RETURNS DOUBLE PRECISION 01263 STABLE 01264 STRICT 01265 LANGUAGE sql 01266 AS $$ 01267 SELECT MADLIB_SCHEMA.simple_silhouette($1, $2, $3, 01268 'MADLIB_SCHEMA.dist_norm2') 01269 $$;