User Documentation
 All Files Functions Groups
online_sv.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file online_sv.sql_in
4  *
5  * @brief SQL functions for support vector machines
6  * @sa For an introduction to Support vector machines (SVMs) and related kernel
7  * methods, see the module description \ref grp_kernmach.
8  *
9  *//* ------------------------------------------------------------------------*/
10 
11 m4_include(`SQLCommon.m4')
12 
13 /**
14 @addtogroup grp_kernmach
15 
16 \warning <em> This MADlib method is still in early stage development. There may be some
17 issues that will be addressed in a future version. Interface and implementation
18 is subject to change. </em>
19 
20 @about
21 
22 Support vector machines (SVMs) and related kernel methods have been one of
23 the most popular and well-studied machine learning techniques of the
24 past 15 years, with an amazing number of innovations and applications.
25 
26 In a nutshell, an SVM model \f$f(x)\f$ takes the form of
27 \f[
28  f(x) = \sum_i \alpha_i k(x_i,x),
29 \f]
30 where each \f$ \alpha_i \f$ is a real number, each \f$ \boldsymbol x_i \f$ is a
31 data point from the training set (called a support vector), and
32 \f$ k(\cdot, \cdot) \f$ is a kernel function that measures how "similar" two
33 objects are. In regression, \f$ f(\boldsymbol x) \f$ is the regression function
34 we seek. In classification, \f$ f(\boldsymbol x) \f$ serves as
35 the decision boundary; so for example in binary classification, the predictor
36 can output class 1 for object \f$x\f$ if \f$ f(\boldsymbol x) \geq 0 \f$, and class
37 2 otherwise.
38 
39 In the case when the kernel function \f$ k(\cdot, \cdot) \f$ is the standard
40 inner product on vectors, \f$ f(\boldsymbol x) \f$ is just an alternative way of
41 writing a linear function
42 \f[
43  f'(\boldsymbol x) = \langle \boldsymbol w, \boldsymbol x \rangle,
44 \f]
45 where \f$ \boldsymbol w \f$ is a weight vector having the same dimension as
46 \f$ \boldsymbol x \f$. One of the key points of SVMs is that we can use more
47 fancy kernel functions to efficiently learn linear models in high-dimensional
48 feature spaces, since \f$ k(\boldsymbol x_i, \boldsymbol x_j) \f$ can be
49 understood as an efficient way of computing an inner product in the feature
50 space:
51 \f[
52  k(\boldsymbol x_i, \boldsymbol x_j)
53  = \langle \phi(\boldsymbol x_i), \phi(\boldsymbol x_j) \rangle,
54 \f]
55 where \f$ \phi(\boldsymbol x) \f$ projects \f$ \boldsymbol x \f$ into a
56 (possibly infinite-dimensional) feature space.
57 
58 There are many algorithms for learning kernel machines. This module
59 implements the class of online learning with kernels algorithms
60 described in Kivinen et al. [1]. It also includes the Stochastic
61 Gradient Descent (SGD) method [3] for learning linear SVMs with the Hinge
62 loss \f$l(z) = \max(0, 1-z)\f$. See also the book Scholkopf and Smola [2] for much more
63 details.
64 
65 The SGD implementation is based on L&eacute;on Bottou's SGD package
66 (http://leon.bottou.org/projects/sgd). The methods introduced in [1]
67 are implemented according to their original descriptions, except that
68 we only update the support vector model when we make a significant
69 error. The original algorithms in [1] update the support vector model at
70 every step, even when no error was made, in the name of
71 regularisation. For practical purposes, and this is verified
72 empirically to a certain degree, updating only when necessary is both
73 faster and better from a learning-theoretic point of view, at least in
74 the i.i.d. setting.
75 
76 Methods for classification, regression and novelty detection are
77 available. Multiple instances of the algorithms can be executed
78 in parallel on different subsets of the training data. The resultant
79 support vector models can then be combined using standard techniques
80 like averaging or majority voting.
81 
82 Training data points are accessed via a table or a view. The support
83 vector models can also be stored in tables for fast execution.
84 
85 @input
86 For classification and regression, the training table/view is expected to be of the following form (the array size of <em>ind</em> must not be greater than 102,400.):\n
87 <pre>{TABLE|VIEW} <em>input_table</em> (
88  ...
89  <em>id</em> INT,
90  <em>ind</em> FLOAT8[],
91  <em>label</em> FLOAT8,
92  ...
93 )</pre>
94 For novelty detection, the label field is not required.
95 
96 @usage
97 
98 - Regression learning is achieved through the following function:
99  <pre>SELECT \ref svm_regression(
100  '<em>input_table</em>', '<em>model_table</em>', <em>parallel</em>, '<em>kernel_func</em>',
101  <em>verbose DEFAULT false</em>, <em>eta DEFAULT 0.1</em>, <em>nu DEFAULT 0.005</em>, <em>slambda DEFAULT 0.05</em>
102  );</pre>
103 
104 - Classification learning is achieved through the following two
105  functions:
106  -# Learn linear SVM(s) using SGD [3]:
107  <pre>SELECT \ref lsvm_classification(
108  '<em>input_table</em>', '<em>model_table</em>', <em>parallel</em>,
109  <em>verbose DEFAULT false</em>, <em>eta DEFAULT 0.1</em>, <em>reg DEFAULT 0.001</em>
110  );</pre>
111  -# Learn linear or non-linear SVM(s) using the method described in [1]:
112  <pre>SELECT \ref svm_classification(
113  '<em>input_table</em>', '<em>model_table</em>', <em>parallel</em>, '<em>kernel_func</em>',
114  <em>verbose DEFAULT false</em>, <em>eta DEFAULT 0.1</em>, <em>nu DEFAULT 0.005</em>
115  );</pre>
116 
117 - Novelty detection is achieved through the following function:
118  <pre>SELECT \ref svm_novelty_detection(
119  '<em>input_table</em>', '<em>model_table</em>', <em>parallel</em>, '<em>kernel_func</em>',
120  <em>verbose DEFAULT false</em>, <em>eta DEFAULT 0.1</em>, <em>nu DEFAULT 0.005</em>
121  );</pre>
122  Assuming the model_table parameter takes on value 'model', each learning function will produce two tables
123  as output: 'model' and 'model_param'.
124  The first contains the support vectors of the model(s) learned.
125  The second contains the parameters of the model(s) learned, which includes information like the kernel function
126  used and the value of the intercept, if there is one.
127 
128 - To make predictions on a single data point x using a single model
129  learned previously, we use the function
130  <pre>SELECT \ref
131  svm_predict('<em>model_table</em>',<em>x</em>);</pre>
132  If the model is produced by the lsvm_classification() function, use
133  the following prediction function instead
134  <pre>SELECT \ref
135  lsvm_predict('<em>model_table</em>',<em>x</em>);</pre>
136 
137 - To make predictions on new data points using multiple models
138  learned in parallel, we use the function
139  <pre>SELECT \ref
140  svm_predict_combo('<em>model_table</em>',<em>x</em>);</pre>
141  If the models are produced by the lsvm_classification() function, use
142  the following prediction function instead
143  <pre>SELECT \ref
144  lsvm_predict_combo('<em>model_table</em>',<em>x</em>);</pre>
145 
146 
147 - Note that, at the moment, we cannot use MADLIB_SCHEMA.svm_predict() and MADLIB_SCHEMA.svm_predict_combo()
148  on multiple data points. For example, something like the following will fail:
149  <pre>SELECT \ref svm_predict('<em>model_table</em>',<em>x</em>) FROM data_table;</pre>
150  Instead, to make predictions on new data points stored in a table using
151  previously learned models, we use the function:
152  <pre>SELECT \ref svm_predict_batch('<em>input_table</em>', '<em>data_col</em>', '<em>id_col</em>', '<em>model_table</em>', '<em>output_table</em>', <em>parallel</em>);</pre>
153  The output_table is created during the function call; an existing table with
154  the same name will be dropped.
155  If the parallel parameter is true, then each data point in the input table will have multiple
156  predicted values corresponding to the number of models learned in
157  parallel.\n\n
158  Similarly, use the following function for batch prediction if the
159  model(s) is produced by the lsvm_classification() function:
160  <pre>SELECT \ref lsvm_predict_batch('<em>input_table</em>', '<em>data_col</em>', '<em>id_col</em>', '<em>model_table</em>','<em>output_table</em>', <em>parallel</em>);</pre>
161 
162 
163 
164 @implementation
165 
166 Currently, three kernel functions have been implemented: dot product (\ref svm_dot), polynomial (\ref svm_polynomial) and Gaussian (\ref svm_gaussian) kernels. To use the dot product kernel function,
167 simply use '<tt><em>MADLIB_SCHEMA.svm_dot</em></tt>' as the <tt>kernel_func</tt> argument, which accepts any function that takes in two float[] and returns a float. To use the polynomial or Gaussian kernels,
168 a wrapper function is needed since these kernels require additional input parameters (see online_sv.sql_in for input parameters).
169 
170 For example, to use the polynomial kernel with degree 2, first create a wrapper function:
171 <pre>CREATE OR REPLACE FUNCTION mykernel(FLOAT[],FLOAT[]) RETURNS FLOAT AS $$
172  SELECT \ref svm_polynomial($1,$2,2)
173 $$ language sql;</pre>
174 Then call the SVM learning functions with <tt>mykernel</tt> as the argument to <tt>kernel_func</tt>.
175 <pre>SELECT \ref svm_regression('my_schema.my_train_data', 'mymodel', false, 'mykernel');</pre>
176 
177 To drop all tables pertaining to the model, we can use
178 <pre>SELECT \ref svm_drop_model('model_table');</pre>
179 
180 @examp
181 
182 As a general first step, we need to prepare and populate an input
183 table/view with the following structure:
184 \code
185 TABLE/VIEW my_schema.my_input_table
186 (
187  id INT, -- point ID
188  ind FLOAT8[], -- data point
189  label FLOAT8 -- label of data point
190 );
191 \endcode
192  Note: The label field is not required for novelty detection.
193 
194 
195 <strong>Example usage for regression</strong>:
196  -# We can randomly generate 1000 5-dimensional data labelled by the simple target function
197 \code
198 t(x) = if x[5] = 10 then 50 else if x[5] = -10 then 50 else 0;
199 \endcode
200 and store that in the my_schema.my_train_data table as follows:
201 \code
202 sql> select MADLIB_SCHEMA.svm_generate_reg_data('my_schema.my_train_data', 1000, 5);
203 \endcode
204  -# We can now learn a regression model and store the resultant model
205  under the name 'myexp'.
206 \code
207 sql> select MADLIB_SCHEMA.svm_regression('my_schema.my_train_data', 'myexp', false, 'MADLIB_SCHEMA.svm_dot');
208 \endcode
209  -# We can now start using it to predict the labels of new data points
210  like as follows:
211 \code
212 sql> select MADLIB_SCHEMA.svm_predict('myexp', '{1,2,4,20,10}');
213 sql> select MADLIB_SCHEMA.svm_predict('myexp', '{1,2,4,20,-10}');
214 \endcode
215  -# To learn multiple support vector models, we replace the learning step above by
216 \code
217 sql> select MADLIB_SCHEMA.svm_regression('my_schema.my_train_data', 'myexp', true, 'MADLIB_SCHEMA.svm_dot');
218 \endcode
219 The resultant models can be used for prediction as follows:
220 \code
221 sql> select * from MADLIB_SCHEMA.svm_predict_combo('myexp', '{1,2,4,20,10}');
222 \endcode
223  -# We can also predict the labels of all the data points stored in a table.
224  For example, we can execute the following:
225 \code
226 sql> create table MADLIB_SCHEMA.svm_reg_test ( id int, ind float8[] );
227 sql> insert into MADLIB_SCHEMA.svm_reg_test (select id, ind from my_schema.my_train_data limit 20);
228 sql> select MADLIB_SCHEMA.svm_predict_batch('MADLIB_SCHEMA.svm_reg_test', 'ind', 'id', 'myexp', 'MADLIB_SCHEMA.svm_reg_output1', false);
229 sql> select * from MADLIB_SCHEMA.svm_reg_output1;
230 sql> select MADLIB_SCHEMA.svm_predict_batch('MADLIB_SCHEMA.svm_reg_test', 'ind', 'id, 'myexp', 'MADLIB_SCHEMA.svm_reg_output2', true);
231 sql> select * from MADLIB_SCHEMA.svm_reg_output2;
232 \endcode
233 
234 <strong>Example usage for classification:</strong>
235 -# We can randomly generate 2000 5-dimensional data labelled by the simple
236 target function
237 \code
238 t(x) = if x[1] > 0 and x[2] < 0 then 1 else -1;
239 \endcode
240 and store that in the my_schema.my_train_data table as follows:
241 \code
242 sql> select MADLIB_SCHEMA.svm_generate_cls_data('my_schema.my_train_data', 2000, 5);
243 \endcode
244 -# We can now learn a classification model and store the resultant model
245 under the name 'myexpc'.
246 \code
247 sql> select MADLIB_SCHEMA.svm_classification('my_schema.my_train_data', 'myexpc', false, 'MADLIB_SCHEMA.svm_dot');
248 \endcode
249 -# We can now start using it to predict the labels of new data points
250 like as follows:
251 \code
252 sql> select MADLIB_SCHEMA.svm_predict('myexpc', '{10,-2,4,20,10}');
253 \endcode
254 -# To learn multiple support vector models, replace the model-building and prediction steps above by
255 \code
256 sql> select MADLIB_SCHEMA.svm_classification('my_schema.my_train_data', 'myexpc', true, 'MADLIB_SCHEMA.svm_dot');
257 sql> select * from MADLIB_SCHEMA.svm_predict_combo('myexpc', '{10,-2,4,20,10}');
258 \endcode
259 -# To learn a linear support vector model using SGD, replace the model-building and prediction steps above by
260 \code
261 sql> select MADLIB_SCHEMA.lsvm_classification('my_schema.my_train_data', 'myexpc', false);
262 sql> select MADLIB_SCHEMA.lsvm_predict('myexpc', '{10,-2,4,20,10}');
263 \endcode
264 -# To learn multiple linear support vector models using SGD, replace the model-building and prediction steps above by
265 \code
266 sql> select MADLIB_SCHEMA.lsvm_classification('my_schema.my_train_data', 'myexpc', true);
267 sql> select MADLIB_SCHEMA.lsvm_predict_combo('myexpc', '{10,-2,4,20,10}');
268 \endcode
269 
270 <strong>Example usage for novelty detection:</strong>
271 -# We can randomly generate 100 2-dimensional data (the normal cases)
272 and store that in the my_schema.my_train_data table as follows:
273 \code
274 sql> select MADLIB_SCHEMA.svm_generate_nd_data('my_schema.my_train_data', 100, 2);
275 \endcode
276 -# Learning and predicting using a single novelty detection model can be done as follows:
277 \code
278 sql> select MADLIB_SCHEMA.svm_novelty_detection('my_schema.my_train_data', 'myexpnd', false, 'MADLIB_SCHEMA.svm_dot');
279 sql> select MADLIB_SCHEMA.svm_predict('myexpnd', '{10,-10}');
280 sql> select MADLIB_SCHEMA.svm_predict('myexpnd', '{-1,-1}');
281 \endcode
282 -# Learning and predicting using multiple models can be done as follows:
283 \code
284 sql> select MADLIB_SCHEMA.svm_novelty_detection('my_schema.my_train_data', 'myexpnd', true, 'MADLIB_SCHEMA.svm_dot');
285 sql> select * from MADLIB_SCHEMA.svm_predict_combo('myexpnd', '{10,-10}');
286 sql> select * from MADLIB_SCHEMA.svm_predict_combo('myexpnd', '{-1,-1}');
287 \endcode
288 
289 
290 @literature
291 
292 [1] Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson: <em>Online
293  Learning with Kernels</em>, IEEE Transactions on Signal Processing, 52(8),
294  2165-2176, 2004.
295 
296 [2] Bernhard Scholkopf and Alexander J. Smola: <em>Learning with Kernels:
297  Support Vector Machines, Regularization, Optimization, and Beyond</em>,
298  MIT Press, 2002.
299 
300 [3] L&eacute;on Bottou: <em>Large-Scale Machine Learning with Stochastic
301 Gradient Descent</em>, Proceedings of the 19th International
302 Conference on Computational Statistics, Springer, 2010.
303 
304 @sa File online_sv.sql_in documenting the SQL functions.
305 
306 @internal
307 @sa namespace online_sv (documenting the implementation in Python)
308 @endinternal
309 
310 */
311 
312 
313 
314 -- The following is the structure to record the results of a learning process.
315 -- We work with arrays of float8 for now; we'll extend the code to work with sparse vectors next.
316 --
317 CREATE TYPE MADLIB_SCHEMA.svm_model_rec AS (
318  inds int, -- number of individuals processed
319  cum_err float8, -- cumulative error
320  epsilon float8, -- the size of the epsilon tube around the hyperplane, adaptively adjusted by algorithm
321  rho float8, -- classification margin
322  b float8, -- classifier offset
323  nsvs int, -- number of support vectors
324  ind_dim int, -- the dimension of the individuals
325  weights float8[], -- the weight of the support vectors
326  individuals float8[], -- the array of support vectors, represented as a 1-D array
327  kernel_oid oid -- OID of kernel function
328 );
329 
330 -- The following is the structure to record the results of the linear SVM sgd algorithm
331 --
332 CREATE TYPE MADLIB_SCHEMA.lsvm_sgd_model_rec AS (
333  weights float8[], -- the weight vector
334  wdiv float8, -- scaling factor for the weights
335  wbias float8, -- offset/bias of the linear model
336  ind_dim int, -- the dimension of the individuals
337  inds int, -- number of individuals processed
338  cum_err int -- cumulative error
339 );
340 
341 
342 -- The following is the return type of a regression learning process
343 --
344 CREATE TYPE MADLIB_SCHEMA.svm_reg_result AS (
345  model_table text, -- table where the model is stored
346  model_name text, -- model name
347  inds int, -- number of individuals processed
348  cum_err float8, -- cumulative error
349  epsilon float8, -- the size of the epsilon tube around the hyperplane, adaptively adjusted by algorithm
350  b float8, -- classifier offset
351  nsvs int -- number of support vectors
352 );
353 
354 -- The following is the return type of a classification learning process
355 --
356 CREATE TYPE MADLIB_SCHEMA.svm_cls_result AS (
357  model_table text, -- table where the model is stored
358  model_name text, -- model name
359  inds int, -- number of individuals processed
360  cum_err float8, -- cumulative error
361  rho float8, -- classification margin
362  b float8, -- classifier offset
363  nsvs int -- number of support vectors
364 );
365 
366 -- The following is the return type of a linear classifier learning process
367 --
368 CREATE TYPE MADLIB_SCHEMA.lsvm_sgd_result AS (
369  model_table text, -- table where the model is stored
370  model_name text, -- model name
371  inds int, -- number of individuals processed
372  ind_dim int, -- the dimension of the individuals
373  cum_err float8, -- cumulative error
374  wdiv float8, -- scaling factor for the weights
375  wbias float8 -- classifier offset
376 );
377 
378 -- The following is the return type of a novelty detection learning process
379 --
380 CREATE TYPE MADLIB_SCHEMA.svm_nd_result AS (
381  model_table text, -- table where the model is stored
382  model_name text, -- model name
383  inds int, -- number of individuals processed
384  rho float8, -- classification margin
385  nsvs int -- number of support vectors
386 );
387 
388 -- The type for representing support vectors
389 --
390 CREATE TYPE MADLIB_SCHEMA.svm_support_vector AS ( id text, weight float8, sv float8[] );
391 
392 
393 
394 -- Kernel functions are a generalisation of inner products.
395 -- They provide the means by which we can extend linear machines to work in non-linear transformed feature spaces.
396 -- Here are a few standard kernels: dot product, polynomial kernel, Gaussian kernel.
397 --
398 /**
399  * @brief Dot product kernel function
400  *
401  * @param x The data point \f$ \boldsymbol x \f$
402  * @param y The data point \f$ \boldsymbol y \f$
403  * @return Returns dot product of the two data points.
404  *
405  */
406 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_dot(x float8[], y float8[]) RETURNS float8
407 AS 'MODULE_PATHNAME', 'svm_dot' LANGUAGE C IMMUTABLE STRICT;
408 
409 /**
410  * @brief Polynomial kernel function
411  *
412  * @param x The data point \f$ \boldsymbol x \f$
413  * @param y The data point \f$ \boldsymbol y \f$
414  * @param degree The degree \f$ d \f$
415  * @return Returns \f$ K(\boldsymbol x,\boldsymbol y)=(\boldsymbol x \cdot \boldsymbol y)^d \f$
416  *
417  */
418 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_polynomial(x float8[], y float8[], degree float8) RETURNS float8
419 AS 'MODULE_PATHNAME', 'svm_polynomial' LANGUAGE C IMMUTABLE STRICT;
420 
421 /**
422  * @brief Gaussian kernel function
423  *
424  * @param x The data point \f$ \boldsymbol x \f$
425  * @param y The data point \f$ \boldsymbol y \f$
426  * @param gamma The spread \f$ \gamma \f$
427  * @return Returns \f$ K(\boldsymbol x,\boldsymbol y)=exp(-\gamma || \boldsymbol x \cdot \boldsymbol y ||^2 ) \f$
428  *
429  */
430 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_gaussian(x float8[], y float8[], gamma float8) RETURNS float8
431 AS 'MODULE_PATHNAME', 'svm_gaussian' LANGUAGE C IMMUTABLE STRICT;
432 
433 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_predict_sub(int,int,float8[],float8[],float8[],text) RETURNS float8
434 AS 'MODULE_PATHNAME', 'svm_predict_sub' LANGUAGE C IMMUTABLE STRICT;
435 
436 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_predict(svs MADLIB_SCHEMA.svm_model_rec, ind float8[], kernel text)
437 RETURNS float8 AS $$
438  SELECT MADLIB_SCHEMA.svm_predict_sub($1.nsvs, $1.ind_dim, $1.weights, $1.individuals, $2, $3);
439 $$ LANGUAGE SQL;
441 -- This is the main online support vector regression learning algorithm.
442 -- The function updates the support vector model as it processes each new training example.
443 -- This function is wrapped in an aggregate function to process all the training examples stored in a table.
444 --
445 CREATE OR REPLACE FUNCTION
446 MADLIB_SCHEMA.svm_reg_update(svs MADLIB_SCHEMA.svm_model_rec, ind FLOAT8[], label FLOAT8, kernel TEXT, eta FLOAT8, nu FLOAT8, slambda FLOAT8)
447 RETURNS MADLIB_SCHEMA.svm_model_rec AS 'MODULE_PATHNAME', 'svm_reg_update' LANGUAGE C STRICT;
448 
449 CREATE AGGREGATE MADLIB_SCHEMA.svm_reg_agg(float8[], float8, text, float8, float8, float8) (
450  sfunc = MADLIB_SCHEMA.svm_reg_update,
451  stype = MADLIB_SCHEMA.svm_model_rec,
452  initcond = '(0,0,0,0,0,0,0,{},{},0)'
453 );
454 
455 -- This is the main online support vector classification learning algorithm.
456 -- The function updates the support vector model as it processes each new training example.
457 -- This function is wrapped in an aggregate function to process all the training examples stored in a table.
458 --
459 CREATE OR REPLACE FUNCTION
460 MADLIB_SCHEMA.svm_cls_update(svs MADLIB_SCHEMA.svm_model_rec, ind FLOAT8[], label FLOAT8, kernel TEXT, eta FLOAT8, nu FLOAT8)
461 RETURNS MADLIB_SCHEMA.svm_model_rec AS 'MODULE_PATHNAME', 'svm_cls_update' LANGUAGE C STRICT;
462 
463 CREATE AGGREGATE MADLIB_SCHEMA.svm_cls_agg(float8[], float8, text, float8, float8) (
464  sfunc = MADLIB_SCHEMA.svm_cls_update,
465  stype = MADLIB_SCHEMA.svm_model_rec,
466  initcond = '(0,0,0,0,0,0,0,{},{},0)'
467 );
468 
469 -- This is the main online support vector novelty detection algorithm.
470 -- The function updates the support vector model as it processes each new training example.
471 -- In contrast to classification and regression, the training data points have no labels.
472 -- This function is wrapped in an aggregate function to process all the training examples stored in a table.
473 --
474 CREATE OR REPLACE FUNCTION
475 MADLIB_SCHEMA.svm_nd_update(svs MADLIB_SCHEMA.svm_model_rec, ind FLOAT8[], kernel TEXT, eta FLOAT8, nu FLOAT8)
476 RETURNS MADLIB_SCHEMA.svm_model_rec AS 'MODULE_PATHNAME', 'svm_nd_update' LANGUAGE C STRICT;
477 
478 CREATE AGGREGATE MADLIB_SCHEMA.svm_nd_agg(float8[], text, float8, float8) (
479  sfunc = MADLIB_SCHEMA.svm_nd_update,
480  stype = MADLIB_SCHEMA.svm_model_rec,
481  initcond = '(0,0,0,0,0,0,0,{},{},0)'
482 );
483 
484 -- This is the SGD algorithm for linear SVMs.
485 -- The function updates the support vector model as it processes each new training example.
486 -- This function is wrapped in an aggregate function to process all the training examples stored in a table.
487 --
488 CREATE OR REPLACE FUNCTION
489 MADLIB_SCHEMA.lsvm_sgd_update(svs MADLIB_SCHEMA.lsvm_sgd_model_rec, ind FLOAT8[], label FLOAT8, eta FLOAT8, reg FLOAT8)
490 RETURNS MADLIB_SCHEMA.lsvm_sgd_model_rec AS 'MODULE_PATHNAME', 'lsvm_sgd_update' LANGUAGE C STRICT;
491 
492 CREATE AGGREGATE MADLIB_SCHEMA.lsvm_sgd_agg(float8[], float8, float8, float8) (
493  sfunc = MADLIB_SCHEMA.lsvm_sgd_update,
494  stype = MADLIB_SCHEMA.lsvm_sgd_model_rec,
495  initcond = '({},1,0,0,0,0)'
496 );
497 
498 
499 -- This function stores a MADLIB_SCHEMA.svm_model_rec stored in model_temp_table into the model_table.
500 --
501 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_store_model(model_temp_table TEXT, model_name TEXT, model_table TEXT) RETURNS VOID AS $$
502 
503  sql = "SELECT COUNT(*) FROM " + model_temp_table + " WHERE id = \'" + model_name + "\'";
504  temp = plpy.execute(sql);
505  if (temp[0]['count'] == 0):
506  plpy.error("No support vector model with name " + model_name + " found.");
507 
508  sql = "SELECT (model).ind_dim, (model).nsvs" \
509  + " FROM " + model_temp_table + " WHERE id = '" + model_name + "'";
510  rv = plpy.execute(sql);
511  myind_dim = rv[0]['ind_dim'];
512  mynsvs = rv[0]['nsvs'];
513 
514  if (mynsvs == 0):
515  plpy.error("The specified model has no support vectors and therefore not processed");
516 
517  idx = 0;
518  for i in range(1,mynsvs+1):
519  idx = myind_dim * (i-1);
520  sql = "INSERT INTO " + model_table \
521  + " SELECT \'" + model_name + "\', (model).weights[" + str(i) + "], " \
522  + " (model).individuals[(" + str(idx+1) + "):(" + str(idx) + "+" + str(myind_dim) + ")] " \
523  + " FROM " + model_temp_table + " WHERE id = \'" + model_name + "\' LIMIT 1";
524  plpy.execute(sql);
525 
526 $$ LANGUAGE plpythonu;
527 
528 /**
529  * @brief Drops all tables pertaining to a model
530  *
531  * @param model_table The table to be dropped.
532  */
533 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_drop_model(model_table TEXT) RETURNS VOID AS $$
534  plpy.execute("drop table if exists " + model_table)
535  plpy.execute("drop table if exists " + model_table + "_param")
536 $$ LANGUAGE plpythonu;
537 
538 CREATE TYPE MADLIB_SCHEMA.svm_model_pr AS ( model text, prediction float8 );
539 
540 /**
541  * @brief Evaluates a support-vector model on a given data point
542  *
543  * @param model_table The table storing the learned model \f$ f \f$ to be used
544  * @param ind The data point \f$ \boldsymbol x \f$
545  * @return This function returns \f$ f(\boldsymbol x) \f$
546  */
547 CREATE OR REPLACE FUNCTION
548 MADLIB_SCHEMA.svm_predict(model_table text, ind float8[]) RETURNS FLOAT8 AS $$
549 
550  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
551 
552  # schema_madlib comes from PythonFunctionBodyOnly
553  return online_sv.svm_predict(model_table, ind);
554 
555 $$ LANGUAGE plpythonu;
556 
557 /**
558  * @brief Evaluates multiple support-vector models on a data point
559  *
560  * @param model_table The table storing the learned models to be used.
561  * @param ind The data point \f$ \boldsymbol x \f$
562  * @return This function returns a table, a row for each model.
563  * Moreover, the last row contains the average value, over all models.
564  *
565  * The different models are assumed to be named <tt><em>model_table</em>1</tt>,
566  * <tt><em>model_table</em>2</tt>, ....
567  */
568 CREATE OR REPLACE FUNCTION
569 MADLIB_SCHEMA.svm_predict_combo(model_table text, ind float8[]) RETURNS SETOF MADLIB_SCHEMA.svm_model_pr AS $$
570 
571  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
572 
573  # schema_madlib comes from PythonFunctionBodyOnly
574  return online_sv.svm_predict_combo( schema_madlib, model_table, ind);
575 
576 $$ LANGUAGE plpythonu;
577 
578 
579 /**
580  * @brief This is the support vector regression function
581  *
582  * @param input_table The name of the table/view with the training data
583  * @param model_table The name of the table under which we want to store the learned model
584  * @param parallel A flag indicating whether the system should learn multiple models in parallel
585  * @param kernel_func Kernel function
586  * @return A summary of the learning process
587  *
588  * @internal
589  * @sa This function is a wrapper for online_sv::svm_regression().
590  */
591 CREATE OR REPLACE FUNCTION
592 MADLIB_SCHEMA.svm_regression(input_table text, model_table text, parallel bool, kernel_func text)
593 RETURNS SETOF MADLIB_SCHEMA.svm_reg_result
594 AS $$
595 
596  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
597 
598  # schema_madlib comes from PythonFunctionBodyOnly
599  return online_sv.svm_regression( schema_madlib, input_table, model_table, parallel, kernel_func);
600 
601 $$ LANGUAGE 'plpythonu';
602 
603 /**
604  * @brief This is the support vector regression function
605  *
606  * @param input_table The name of the table/view with the training data
607  * @param model_table The name of the table under which we want to store the learned model
608  * @param parallel A flag indicating whether the system should learn multiple models in parallel
609  * @param kernel_func Kernel function
610  * @param verbose Verbosity of reporting
611  * @param eta Learning rate in (0,1]
612  * @param nu Compression parameter in (0,1] associated with the fraction of training data that will become support vectors
613  * @param slambda Regularisation parameter
614  * @return A summary of the learning process
615  *
616  * @internal
617  * @sa This function is a wrapper for online_sv::svm_regression().
618  */
619 CREATE OR REPLACE FUNCTION
620 MADLIB_SCHEMA.svm_regression(input_table text, model_table text, parallel bool, kernel_func text, verbose bool, eta float8, nu float8, slambda float8)
621 RETURNS SETOF MADLIB_SCHEMA.svm_reg_result
622 AS $$
623 
624  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
625 
626  # schema_madlib comes from PythonFunctionBodyOnly
627  return online_sv.svm_regression( schema_madlib, input_table, model_table, parallel, kernel_func, verbose, eta, nu, slambda);
628 
629 $$ LANGUAGE 'plpythonu';
630 
631 /**
632  * @brief This is the support vector classification function
633  *
634  * @param input_table The name of the table/view with the training data
635  * @param model_table The name of the table under which we want to store the learned model
636  * @param parallel A flag indicating whether the system should learn multiple models in parallel
637  * @param kernel_func Kernel function
638  * @return A summary of the learning process
639  *
640  * @internal
641  * @sa This function is a wrapper for online_sv::svm_classification().
642  */
643 CREATE OR REPLACE FUNCTION
644 MADLIB_SCHEMA.svm_classification(input_table text, model_table text, parallel bool, kernel_func text)
645 RETURNS SETOF MADLIB_SCHEMA.svm_cls_result
646 AS $$
647 
648  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
649 
650  # schema_madlib comes from PythonFunctionBodyOnly
651  return online_sv.svm_classification( schema_madlib, input_table, model_table, parallel, kernel_func);
652 
653 $$ LANGUAGE 'plpythonu';
654 
655 /**
656  * @brief This is the support vector classification function
657  *
658  * @param input_table The name of the table/view with the training data
659  * @param model_table The name of the table under which we want to store the learned model
660  * @param parallel A flag indicating whether the system should learn multiple models in parallel
661  * @param kernel_func Kernel function
662  * @param verbose Verbosity of reporting
663  * @param eta Learning rate in (0,1]
664  * @param nu Compression parameter in (0,1] associated with the fraction of training data that will become support vectors
665  * @return A summary of the learning process
666  *
667  * @internal
668  * @sa This function is a wrapper for online_sv::svm_classification().
669  */
670 CREATE OR REPLACE FUNCTION
671 MADLIB_SCHEMA.svm_classification(input_table text, model_table text, parallel bool, kernel_func text, verbose bool, eta float8, nu float8)
672 RETURNS SETOF MADLIB_SCHEMA.svm_cls_result
673 AS $$
674 
675  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
676 
677  # schema_madlib comes from PythonFunctionBodyOnly
678  return online_sv.svm_classification( schema_madlib, input_table, model_table, parallel, kernel_func, verbose, eta, nu);
679 
680 $$ LANGUAGE 'plpythonu';
681 
682 /**
683  * @brief This is the support vector novelty detection function.
684  *
685  * @param input_table The name of the table/view with the training data
686  * @param model_table The name of the table under which we want to store the learned model
687  * @param parallel A flag indicating whether the system should learn multiple models in parallel
688  * @param kernel_func Kernel function
689  * @return A summary of the learning process
690  *
691  * @internal
692  * @sa This function is a wrapper for online_sv::svm_novelty_detection().
693  */
694 CREATE OR REPLACE FUNCTION
695 MADLIB_SCHEMA.svm_novelty_detection(input_table text, model_table text, parallel bool, kernel_func text)
696 RETURNS SETOF MADLIB_SCHEMA.svm_nd_result
697 AS $$
698 
699  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
700 
701  # schema_madlib comes from PythonFunctionBodyOnly
702  return online_sv.svm_novelty_detection( schema_madlib, input_table, model_table, parallel, kernel_func);
703 
704 $$ LANGUAGE 'plpythonu';
705 
706 /**
707  * @brief This is the support vector novelty detection function.
708  *
709  * @param input_table The name of the table/view with the training data
710  * @param model_table The name of the table under which we want to store the learned model
711  * @param parallel A flag indicating whether the system should learn multiple models in parallel
712  * @param kernel_func Kernel function
713  * @param verbose Verbosity of reporting
714  * @param eta Learning rate in (0,1]
715  * @param nu Compression parameter in (0,1] associated with the fraction of training data that will become support vectors
716  * @return A summary of the learning process
717  *
718  * @internal
719  * @sa This function is a wrapper for online_sv::svm_novelty_detection().
720  */
721 CREATE OR REPLACE FUNCTION
722 MADLIB_SCHEMA.svm_novelty_detection(input_table text, model_table text, parallel bool, kernel_func text, verbose bool, eta float8, nu float8)
723 RETURNS SETOF MADLIB_SCHEMA.svm_nd_result
724 AS $$
725 
726  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
727 
728  # schema_madlib comes from PythonFunctionBodyOnly
729  return online_sv.svm_novelty_detection( schema_madlib, input_table, model_table, parallel, kernel_func, verbose, eta, nu);
730 
731 $$ LANGUAGE 'plpythonu';
732 
733 
734 /**
735  * @brief Scores the data points stored in a table using a learned support-vector model
736  *
737  * @param input_table Name of table/view containing the data points to be scored
738  * @param data_col Name of column in input_table containing the data points
739  * @param id_col Name of column in input_table containing the integer identifier of data points
740  * @param model_table Name of table where the learned model to be used is stored
741  * @param output_table Name of table to store the results
742  * @param parallel A flag indicating whether the model to be used was learned in parallel
743  * @return Textual summary of the algorithm run
744  *
745  * @internal
746  * @sa This function is a wrapper for online_sv::svm_predict_batch().
747  */
748 CREATE OR REPLACE FUNCTION
749 MADLIB_SCHEMA.svm_predict_batch(input_table text, data_col text, id_col text, model_table text, output_table text, parallel bool)
750 RETURNS TEXT
751 AS $$
752 
753  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
754 
755  # schema_madlib comes from PythonFunctionBodyOnly
756  return online_sv.svm_predict_batch( input_table, data_col, id_col, model_table, output_table, parallel);
757 
758 $$ LANGUAGE 'plpythonu';
759 
760 -- Generate artificial training data
761 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__svm_random_ind(d INT) RETURNS float8[] AS $$
762 DECLARE
763  ret float8[];
764 BEGIN
765  FOR i IN 1..(d-1) LOOP
766  ret[i] = RANDOM() * 40 - 20;
767  END LOOP;
768  IF (RANDOM() > 0.5) THEN
769  ret[d] = 10;
770  ELSE
771  ret[d] = -10;
772  END IF;
773  RETURN ret;
774 END
775 $$ LANGUAGE plpgsql;
776 
777 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__svm_random_ind2(d INT) RETURNS float8[] AS $$
778 DECLARE
779  ret float8[];
780 BEGIN
781  FOR i IN 1..d LOOP
782  ret[i] = RANDOM() * 5 + 10;
783  IF (RANDOM() > 0.5) THEN ret[i] = -ret[i]; END IF;
784  END LOOP;
785  RETURN ret;
786 END
787 $$ LANGUAGE plpgsql;
788 
789 
790 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_generate_reg_data(output_table text, num int, dim int) RETURNS VOID AS $$
791  plpy.execute("drop table if exists " + output_table)
792  plpy.execute("create table " + output_table + " ( id int, ind float8[], label float8 ) m4_ifdef(`__GREENPLUM__', `distributed by (id)')")
793  plpy.execute("INSERT INTO " + output_table + " SELECT a.val, MADLIB_SCHEMA.__svm_random_ind(" + str(dim) + "), 0 FROM (SELECT generate_series(1," + str(num) + ") AS val) AS a")
794  plpy.execute("UPDATE " + output_table + " SET label = MADLIB_SCHEMA.__svm_target_reg_func(ind)")
795 $$ LANGUAGE 'plpythonu';
796 
797 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__svm_target_reg_func(ind float8[]) RETURNS float8 AS $$
798 DECLARE
799  dim int;
800 BEGIN
801  dim = array_upper(ind,1);
802  IF (ind[dim] = 10) THEN RETURN 50; END IF;
803  RETURN -50;
804 END
805 $$ LANGUAGE plpgsql;
806 
807 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_generate_cls_data(output_table text, num int, dim int) RETURNS VOID AS $$
808  plpy.execute("drop table if exists " + output_table);
809  plpy.execute("create table " + output_table + " ( id int, ind float8[], label float8 ) m4_ifdef(`__GREENPLUM__', `distributed by (id)')")
810  plpy.execute("INSERT INTO " + output_table + " SELECT a.val, MADLIB_SCHEMA.__svm_random_ind(" + str(dim) + "), 0 FROM (SELECT generate_series(1," + str(num) + ") AS val) AS a")
811  plpy.execute("UPDATE " + output_table + " SET label = MADLIB_SCHEMA.__svm_target_cl_func(ind)")
812 $$ LANGUAGE 'plpythonu';
813 
814 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__svm_target_cl_func(ind float8[]) RETURNS float8 AS $$
815 BEGIN
816  IF (ind[1] > 0 AND ind[2] < 0) THEN RETURN 1; END IF;
817  RETURN -1;
818 END
819 $$ LANGUAGE plpgsql;
820 
821 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_generate_nd_data(output_table text, num int, dim int) RETURNS VOID AS $$
822  plpy.execute("drop table if exists " + output_table);
823  plpy.execute("create table " + output_table + " ( id int, ind float8[] ) m4_ifdef(`__GREENPLUM__', `distributed by (id)')")
824  plpy.execute("INSERT INTO " + output_table + " SELECT a.val, MADLIB_SCHEMA.__svm_random_ind2(" + str(dim) + ") FROM (SELECT generate_series(1," + str(num) + ") AS val) AS a")
825 $$ LANGUAGE 'plpythonu';
826 
827 
828 /**
829  * @brief Normalizes the data stored in a table, and save the normalized data in a new table.
830  *
831  * @param input_table Name of table/view containing the data points to be scored
832  */
833 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.svm_data_normalization(input_table TEXT) RETURNS VOID AS $$
834  output_table = input_table + "_scaled";
835  plpy.execute("DROP TABLE IF EXISTS " + output_table);
836  plpy.execute("CREATE TABLE " + output_table + " ( id int, ind float8[], label int ) m4_ifdef(`__GREENPLUM__', `distributed by (id)')");
837  plpy.execute("INSERT INTO " + output_table + " SELECT id, MADLIB_SCHEMA.svm_normalization(ind), label FROM " + input_table);
838  plpy.info("output table: %s" % output_table)
839 $$ LANGUAGE plpythonu;
840 
841 
842 /**
843  * @brief This is the linear support vector classification function
844  *
845  * @param input_table The name of the table/view with the training data
846  * @param model_table The name of the table under which we want to store the learned model
847  * @param parallel A flag indicating whether the system should learn multiple models in parallel
848  * @return A summary of the learning process
849  *
850  * @internal
851  * @sa This function is a wrapper for online_sv::lsvm_classification().
852 */
853 CREATE OR REPLACE FUNCTION
854 MADLIB_SCHEMA.lsvm_classification(input_table text, model_table text, parallel bool)
855 RETURNS SETOF MADLIB_SCHEMA.lsvm_sgd_result
856 AS $$
857  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
858  # schema_madlib comes from PythonFunctionBodyOnly
859  return online_sv.lsvm_classification( schema_madlib, input_table, model_table, parallel);
860 $$ LANGUAGE 'plpythonu';
861 
862 
863 
864 /**
865  * @brief This is the linear support vector classification function
866  *
867  * @param input_table The name of the table/view with the training data
868  * @param model_table The name of the table under which we want to store the learned model
869  * @param parallel A flag indicating whether the system should learn multiple models in parallel
870  * @param verbose Verbosity of reporting
871  * @param eta Initial learning rate in (0,1]
872  * @param reg Regularization parameter, often chosen by cross-validation
873  * @return A summary of the learning process
874  *
875  * @internal
876  * @sa This function is a wrapper for online_sv::lsvm_classification().
877 */
878 CREATE OR REPLACE FUNCTION
879 MADLIB_SCHEMA.lsvm_classification(input_table text, model_table text, parallel bool, verbose bool, eta float8, reg float8)
880 RETURNS SETOF MADLIB_SCHEMA.lsvm_sgd_result
881 AS $$
882 
883  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
884 
885  # schema_madlib comes from PythonFunctionBodyOnly
886  return online_sv.lsvm_classification( schema_madlib, input_table, model_table, parallel, verbose, eta, reg);
887 
888 $$ LANGUAGE 'plpythonu';
889 
890 
891 /**
892  * @brief Scores the data points stored in a table using a learned linear support-vector model
893  *
894  * @param input_table Name of table/view containing the data points to be scored
895  * @param data_col Name of column in input_table containing the data points
896  * @param id_col Name of column in input_table containing the integer identifier of data points
897  * @param model_table Name of table where the learned model to be used is stored
898  * @param output_table Name of table to store the results
899  * @param parallel A flag indicating whether the model to be used was learned in parallel
900  * @return Textual summary of the algorithm run
901  *
902  * @internal
903  * @sa This function is a wrapper for online_sv::lsvm_predict_batch().
904  */
905 CREATE OR REPLACE FUNCTION
906 MADLIB_SCHEMA.lsvm_predict_batch(input_table text, data_col text, id_col text, model_table text, output_table text, parallel bool)
907 RETURNS TEXT
908 AS $$
909 
910  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
911 
912  # schema_madlib comes from PythonFunctionBodyOnly
913  return online_sv.lsvm_predict_batch( schema_madlib, input_table, data_col, id_col, model_table, output_table, parallel);
914 
915 $$ LANGUAGE 'plpythonu';
916 
917 
918 /**
919  * @brief Evaluates a linear support-vector model on a given data point
920  *
921  * @param model_table The table storing the learned model \f$ f \f$ to be used
922  * @param ind The data point \f$ \boldsymbol x \f$
923  * @return This function returns \f$ f(\boldsymbol x) \f$
924  */
925 CREATE OR REPLACE FUNCTION
926 MADLIB_SCHEMA.lsvm_predict(model_table text, ind float8[]) RETURNS FLOAT8 AS $$
928  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
929 
930  # schema_madlib comes from PythonFunctionBodyOnly
931  return online_sv.lsvm_predict(schema_madlib, model_table, ind);
932 
933 $$ LANGUAGE plpythonu;
934 
935 /**
936  * @brief Evaluates multiple linear support-vector models on a data point
937  *
938  * @param model_table The table storing the learned models to be used.
939  * @param ind The data point \f$ \boldsymbol x \f$
940  * @return This function returns a table, a row for each model.
941  * Moreover, the last row contains the average value, over all models.
942  *
943  * The different models are assumed to be named <tt><em>model_table</em>0</tt>,
944  * <tt><em>model_table</em>1</tt>, ....
945  */
946 CREATE OR REPLACE FUNCTION
947 MADLIB_SCHEMA.lsvm_predict_combo(model_table text, ind float8[]) RETURNS SETOF MADLIB_SCHEMA.svm_model_pr AS $$
948 
949  PythonFunctionBodyOnly(`kernel_machines', `online_sv')
950 
951  # schema_madlib comes from PythonFunctionBodyOnly
952  return online_sv.lsvm_predict_combo( schema_madlib, model_table, ind);
953 
954 $$ LANGUAGE plpythonu;