User Documentation
 All Files Functions Groups
linear.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file linear.sql_in
4  *
5  * @brief SQL functions for linear regression
6  * @date January 2011
7  *
8  * @sa For a brief introduction to linear regression, see the module
9  * description \ref grp_linreg.
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 @addtogroup grp_linreg
17 
18 @about
19 
20 Ordinary least-squares (OLS) linear regression refers to a stochastic model in
21 which the conditional mean of the dependent variable (usually denoted \f$ Y \f$)
22 is an affine function of the vector of independent variables (usually denoted
23 \f$ \boldsymbol x \f$). That is,
24 \f[
25  E[Y \mid \boldsymbol x] = \boldsymbol c^T \boldsymbol x
26 \f]
27 for some unknown vector of coefficients \f$ \boldsymbol c \f$. The assumption is
28 that the residuals are i.i.d. distributed Gaussians. That is, the (conditional)
29 probability density of \f$ Y \f$ is given by
30 \f[
31  f(y \mid \boldsymbol x)
32  = \frac{1}{\sqrt{2 \pi \sigma^2}}
33  \cdot \exp\left(-\frac{1}{2 \sigma^2}
34  \cdot (y - \boldsymbol x^T \boldsymbol c)^2 \right)
35  \,.
36 \f]
37 OLS linear regression finds the vector of coefficients \f$ \boldsymbol c \f$
38 that maximizes the likelihood of the observations.
39 
40 Let
41 - \f$ \boldsymbol y \in \mathbf R^n \f$ denote the vector of observed dependent
42  variables, with \f$ n \f$ rows, containing the observed values of the
43  dependent variable,
44 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$
45  columns and \f$ n \f$ rows, containing all observed vectors of independent
46  variables.
47  \f$ \boldsymbol x_i \f$ as rows,
48 - \f$ X^T \f$ denote the transpose of \f$ X \f$,
49 - \f$ X^+ \f$ denote the pseudo-inverse of \f$ X \f$.
50 
51 Maximizing the likelihood is equivalent to maximizing the log-likelihood
52 \f$ \sum_{i=1}^n \log f(y_i \mid \boldsymbol x_i) \f$, which simplifies to
53 minimizing the <b>residual sum of squares</b> \f$ RSS \f$ (also called sum of
54 squared residuals or sum of squared errors of prediction),
55 \f[
56  RSS = \sum_{i=1}^n ( y_i - \boldsymbol c^T \boldsymbol x_i )^2
57  = (\boldsymbol y - X \boldsymbol c)^T (\boldsymbol y - X \boldsymbol c)
58  \,.
59 \f]
60 The first-order conditions yield that the \f$ RSS \f$ is minimized at
61 \f[
62  \boldsymbol c = (X^T X)^+ X^T \boldsymbol y
63  \,.
64 \f]
65 
66 Computing the <b>total sum of squares</b> \f$ TSS \f$, the <b>explained
67 sum of squares</b> \f$ ESS \f$ (also called the regression sum of
68 squares), and the <b>coefficient of determination</b> \f$ R^2 \f$ is
69 done according to the following formulas:
70 \f{align*}{
71  ESS & = \boldsymbol y^T X \boldsymbol c
72  - \frac{ \| y \|_1^2 }{n} \\
73  TSS & = \sum_{i=1}^n y_i^2
74  - \frac{ \| y \|_1^2 }{n} \\
75  R^2 & = \frac{ESS}{TSS}
76 \f}
77 Note: The last equality follows from the definition
78 \f$ R^2 = 1 - \frac{RSS}{TSS} \f$ and the fact that for linear regression
79 \f$ TSS = RSS + ESS \f$. A proof of the latter can be found, e.g., at:
80 http://en.wikipedia.org/wiki/Sum_of_squares
81 
82 We estimate the variance
83 \f$ Var[Y - \boldsymbol c^T \boldsymbol x \mid \boldsymbol x] \f$ as
84 \f[
85  \sigma^2 = \frac{RSS}{n - k}
86 \f]
87 and compute the t-statistic for coefficient \f$ i \f$ as
88 \f[
89  t_i = \frac{c_i}{\sqrt{\sigma^2 \cdot \left( (X^T X)^{-1} \right)_{ii} }}
90  \,.
91 \f]
92 
93 The \f$ p \f$-value for coefficient \f$ i \f$ gives the probability of seeing a
94 value at least as extreme as the one observed, provided that the null hypothesis
95 (\f$ c_i = 0 \f$) is true. Letting \f$ F_\nu \f$ denote the
96 cumulative density function of student-t with \f$ \nu \f$ degrees of freedom,
97 the \f$ p \f$-value for coefficient \f$ i \f$
98 is therefore
99 \f[
100  p_i = \Pr(|T| \geq |t_i|) = 2 \cdot (1 - F_{n - k}( |t_i| ))
101 \f]
102 where \f$ T \f$ is a student-t distributed random variable with mean 0.
103 
104 The condition number [2] \f$ \kappa(X) = \|X\|_2\cdot\|X^{-1}\|_2\f$ is computed
105 as the product of two spectral norms [3]. The spectral norm of a matrix \f$X\f$
106 is the largest singular value of \f$X\f$ i.e. the square root of the largest
107 eigenvalue of the positive-semidefinite matrix \f$X^{*}X\f$:
108 
109 \f[
110  \|X\|_2 = \sqrt{\lambda_{\max}\left(X^{*}X\right)}\ ,
111 \f]
112 where \f$X^{*}\f$ is the conjugate transpose of \f$X\f$.
113 The condition number of a linear regression problem
114 is a worst-case measure of how sensitive the
115 result is to small perturbations of the input. A large condition number (say,
116 more than 1000) indicates the presence of significant multicollinearity.
117 
118 @input
119 
120 The training data is expected to be of the following form:
121 <pre>{TABLE|VIEW} <em>sourceName</em> (
122  ...
123  <em>dependentVariable</em> FLOAT8,
124  <em>independentVariables</em> FLOAT8[],
125  ...
126 )</pre>
127 
128 @usage
129 
130 <b>(1) The Simple Interface</b>
131 
132 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic statistics:
133 <pre>SELECT (madlib.\ref linregr(<em>dependentVariable</em>,
134  <em>independentVariables</em>)).*
135 FROM <em>sourceName</em>;</pre>
136  Output:
137  <pre>
138 coef | r2 | std_err | t_stats | p_values | condition_no
139 -----+----+---------+---------+----------+-------------
140  ...
141 </pre>
142 
143 - Get vector of coefficients \f$ \boldsymbol c \f$:\n
144  <pre>SELECT (madlib.\ref linregr(<em>dependentVariable</em>,
145  <em>independentVariables</em>)).coef
146 FROM <em>sourceName</em>;</pre>
147 
148 - Get a subset of the output columns, e.g., only the array of coefficients
149  \f$ \boldsymbol c \f$, the coefficient of determination \f$ R^2 \f$, and
150  the array of p-values \f$ \boldsymbol p \f$:
151  <pre>SELECT (lr).coef, (lr).r2, (lr).p_values
152 FROM (
153  SELECT madlib.\ref linregr(<em>dependentVariable</em>,
154  <em>independentVariables</em>) AS lr
155  FROM <em>sourceName</em>
156 ) AS subq;</pre>
157 
158 <b>(2) The Full Interface</b>
159 
160 The full interface support the analysis of heteroskedasticity of the linear fit.
161 
162 <pre>
163 SELECT madlib.\ref linregr_train (
164  <em>'source_table'</em>, -- name of input table, VARCHAR
165  <em>'out_table'</em>, -- name of output table, VARCHAR
166  <em>'dependent_varname'</em>, -- dependent variable, VARCHAR
167  <em>'independent_varname'</em>, -- independent variable, VARCHAR
168  [<em>group_cols</em>, -- names of columns to group by, VARCHAR[].
169  -- Default value: Null
170  [<em>heteroskedasticity_option</em>]] -- whether to analyze
171  -- heteroskedasticity,
172  -- BOOLEAN. Default value: False
173 );
174 </pre>
175 
176 Here the <em>'independent_varname'</em> can be the name of a column, which contains
177 array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]',
178 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
179 
180 Output is stored in the <em>out_table</em>:
181 <pre>
182 [ group_col_1 | group_col_2 | ... |] coef | r2 | std_err | t_stats | p_values | condition_no [|
183 -----------+-------------+-----+------+----+---------+---------+----------+--------------+---
184 
185 bp_stats | bp_p_value ]
186 -------------+---------
187 </pre>
188 
189 Where the first part <pre>[ group_col_1 | group_col_2 | ... |]</pre> presents
190 only when <em>group_cols</em> is not Null. The last part <pre>[ bp_stats | ... |
191 corrected_p_values ]</pre> presents only when <em>heteroskedasticity_option</em>
192 is <em>True</em>.
193 
194 When <em>group_cols</em> is given, the data is grouped by the given columns and
195 a linear model is fit to each group of data. The output will have additional
196 columns for all combinations of the values of all the <em>group_cols</em>. For
197 each combination of <em>group_cols</em> values, linear regression result is
198 shown.
199 
200 When <em>heteroskedasticity_option</em> is <em>True</em>, the output will have
201 additional columns. The function computes the Breusch–Pagan test [4] statistics
202 and the corresponding \f$p\f$-value.
203 
204 @examp
205 
206 The following example is taken from
207 http://www.stat.columbia.edu/~martin/W2110/SAS_7.pdf.
208 
209 -# Create the sample data set:
210 \verbatim
211 sql> CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
212  size INT, lot INT);
213 sql> COPY houses FROM STDIN WITH DELIMITER '|';
214  1 | 590 | 2 | 1 | 50000 | 770 | 22100
215  2 | 1050 | 3 | 2 | 85000 | 1410 | 12000
216  3 | 20 | 3 | 1 | 22500 | 1060 | 3500
217  4 | 870 | 2 | 2 | 90000 | 1300 | 17500
218  5 | 1320 | 3 | 2 | 133000 | 1500 | 30000
219  6 | 1350 | 2 | 1 | 90500 | 820 | 25700
220  7 | 2790 | 3 | 2.5 | 260000 | 2130 | 25000
221  8 | 680 | 2 | 1 | 142500 | 1170 | 22000
222  9 | 1840 | 3 | 2 | 160000 | 1500 | 19000
223  10 | 3680 | 4 | 2 | 240000 | 2790 | 20000
224  11 | 1660 | 3 | 1 | 87000 | 1030 | 17500
225  12 | 1620 | 3 | 2 | 118600 | 1250 | 20000
226  13 | 3100 | 3 | 2 | 140000 | 1760 | 38000
227  14 | 2070 | 2 | 3 | 148000 | 1550 | 14000
228  15 | 650 | 3 | 1.5 | 65000 | 1450 | 12000
229 \.
230 \endverbatim
231 -# You can call the linregr() function for an individual metric:
232 \verbatim
233 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).coef FROM houses;
234  coef
235 ------------------------------------------------------------------------
236  {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
237 (1 row)
238 
239 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).r2 FROM houses;
240  r2
241 -------------------
242  0.745374010140315
243 (1 row)
244 
245 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).std_err FROM houses;
246  std_err
247 ----------------------------------------------------------------------
248  {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
249 (1 row)
250 
251 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).t_stats FROM houses;
252  t_stats
253 ------------------------------------------------------------------------
254  {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
255 (1 row)
256 
257 sql> SELECT (linregr(price, array[1, bedroom, bath, size])).p_values FROM houses;
258  p_values
259 -----------------------------------------------------------------------------
260  {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
261 (1 row)
262 \endverbatim
263 -# Alternatively you can call the linreg() function for the full record:
264 \verbatim
265 sql> \x on
266 Expanded display is on.
267 sql> SELECT (r).* FROM (SELECT linregr(price, array[1, bedroom, bath, size])
268  AS r FROM houses) q;
269 -[ RECORD 1 ]+-----------------------------------------------------------------
270 coef | {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
271 r2 | 0.745374010140315
272 std_err | {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
273 t_stats | {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
274 p_values | {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
275 condition_no | 9783.018
276 
277 \endverbatim
278 
279 -# You can call linregr_train() function for more functionality
280 \verbatim
281 sql> SELECT madlib.linregr_train('houses', 'result', 'price',
282  'array[1, tax, bath, size]',
283  '{bedroom}'::varchar[], True);
284 
285 sql> SELECT * from result;
286 -[ RECORD 1]---------+-------------------------------------------------------
287 bedroom | 2
288 coef | {-84242.0345, 55.4430, -78966.9754, 225.6119}
289 r2 | 0.9688
290 std_err | {35019.00, 19.57, 23036.81, 49.04}
291 t_stats | {-2.406, 2.833, -3.428, 4.600}
292 p_values | {0.251, 0.216, 0.181, 0.136}
293 condition_no | 10086.1
294 bp_stats | 2.5451
295 bp_p_value | 0.4672
296 
297 -[ RECORD 2]---------+------------------------------------------------------
298 bedroom | 3
299 coef | {-88155.8292502747,27.1966436293179,41404.0293389239,62.6375210724027}
300 r2 | 0.841699901312963
301 std_err | {57867.9999699512,17.82723091538,43643.1321521931,70.8506824870639}
302 t_stats | {-1.52339512850022,1.52556747362568,0.948695185179172,0.884077878626493}
303 p_values | {0.18816143289241,0.187636685729725,0.38634003235866,0.417132778730133}
304 condition_no | 11722.62
305 bp_stats | 6.7538
306 bp_p_value | 0.08017
307 
308 -[ RECORD 3]---------+-------------------------------------------------------
309 bedroom | 4
310 coef | {0.0112536020318378,41.4132554771633,0.0225072040636757,31.3975496688276}
311 r2 | 1
312 std_err | {0,0,0,0}
313 t_stats | {Infinity,Infinity,Infinity,Infinity}
314 p_values | Null
315 condition_no | Null
316 bp_stats | Null
317 bp_p_value | Null
318 
319 \endverbatim
320 
321 @literature
322 
323 [1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 21 October
324  2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/17/lecture-17.pdf
325 
326 [2] Wikipedia: Condition Number, http://en.wikipedia.org/wiki/Condition_number.
327 
328 [3] Wikipedia: Spectral Norm,
329  http://en.wikipedia.org/wiki/Spectral_norm#Spectral_norm
330 
331 [4] Wikipedia: Breusch–Pagan test,
332  http://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test
333 
334 [5] Wikipedia: Heteroscedasticity-consistent standard errors,
335 http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors
336 
337 @sa File linear.sql_in documenting the SQL functions.
338 
339 @internal
340 @sa Namespace \ref madlib::modules::regress
341  documenting the implementation in C++
342 @endinternal
343 */
344 ---------------------------------------------------------------------------
345 CREATE TYPE MADLIB_SCHEMA.linregr_result AS (
346  coef DOUBLE PRECISION[],
347  r2 DOUBLE PRECISION,
348  std_err DOUBLE PRECISION[],
349  t_stats DOUBLE PRECISION[],
350  p_values DOUBLE PRECISION[],
351  condition_no DOUBLE PRECISION
352 );
353 
354 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.linregr_transition(
355  state MADLIB_SCHEMA.bytea8,
356  y DOUBLE PRECISION,
357  x DOUBLE PRECISION[])
358 RETURNS MADLIB_SCHEMA.bytea8
359 AS 'MODULE_PATHNAME'
360 LANGUAGE C
361 IMMUTABLE STRICT;
362 
363 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.linregr_merge_states(
364  state1 MADLIB_SCHEMA.bytea8,
365  state2 MADLIB_SCHEMA.bytea8)
366 RETURNS MADLIB_SCHEMA.bytea8
367 AS 'MODULE_PATHNAME'
368 LANGUAGE C
369 IMMUTABLE STRICT;
370 
371 -- Final functions
372 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.linregr_final(
373  state MADLIB_SCHEMA.bytea8)
374 RETURNS MADLIB_SCHEMA.linregr_result
375 AS 'MODULE_PATHNAME'
376 LANGUAGE C IMMUTABLE STRICT;
377 
378 
379 CREATE TYPE MADLIB_SCHEMA.heteroskedasticity_test_result AS (
380  bp_stats DOUBLE PRECISION,
381  bp_p_value DOUBLE PRECISION
382 );
383 
384 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.hetero_linregr_transition(
385  state MADLIB_SCHEMA.bytea8,
386  y DOUBLE PRECISION,
387  x DOUBLE PRECISION[],
388  coef DOUBLE PRECISION[])
389 RETURNS MADLIB_SCHEMA.bytea8
390 AS 'MODULE_PATHNAME'
391 LANGUAGE C
392 IMMUTABLE STRICT;
393 
394 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.hetero_linregr_merge_states(
395  state1 MADLIB_SCHEMA.bytea8,
396  state2 MADLIB_SCHEMA.bytea8)
397 RETURNS MADLIB_SCHEMA.bytea8
398 AS 'MODULE_PATHNAME'
399 LANGUAGE C
400 IMMUTABLE STRICT;
401 
402 -- Final functions
403 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.hetero_linregr_final(
404  state MADLIB_SCHEMA.bytea8)
405 RETURNS MADLIB_SCHEMA.heteroskedasticity_test_result
406 AS 'MODULE_PATHNAME'
407 LANGUAGE C IMMUTABLE STRICT;
408 
409 /**
410  * @brief Compute studentized Breuch-Pagan heteroskedasticity test for
411  * linear regression.
412  *
413  * @param dependentVariable Column containing the dependent variable
414  * @param independentVariables Column containing the array of independent variables
415  * @param olsCoefficients Column containing the array of the OLS coefficients (as obtained by linregr)
416  *
417  * @par
418  * To include an intercept in the model, set one coordinate in the
419  * <tt>independentVariables</tt> array to 1.
420  *
421  * @return A composite value:
422  * - <tt>test_statistic FLOAT8[]</tt> - Prob > test_statistc
423  * - <tt>p_value FLOAT8[]</tt> - Prob > test_statistc
424  *
425  * @usage
426  * <pre> SELECT (heteoskedasticity_test_linregr(<em>dependentVariable</em>,
427  * <em>independentVariables</em>, coef)).*
428  * FROM (
429  * SELECT linregr(<em>dependentVariable</em>, <em>independentVariables</em>).coef
430  * ) AS ols_coef, <em>sourceName</em> as src;
431  * </pre>
432  */
433 CREATE AGGREGATE MADLIB_SCHEMA.heteroskedasticity_test_linregr(
434  /*+ "dependentVariable" */ DOUBLE PRECISION,
435  /*+ "independentVariables" */ DOUBLE PRECISION[],
436  /*+ "olsCoefficients" */ DOUBLE PRECISION[]) (
437 
438  SFUNC=MADLIB_SCHEMA.hetero_linregr_transition,
439  STYPE=MADLIB_SCHEMA.bytea8,
440  FINALFUNC=MADLIB_SCHEMA.hetero_linregr_final,
441  m4_ifdef(`GREENPLUM',`prefunc=MADLIB_SCHEMA.hetero_linregr_merge_states,')
442  INITCOND=''
443 );
444 
445 /**
446  * @brief Compute linear regression coefficients and diagnostic statistics.
447  *
448  * @param dependentVariable Column containing the dependent variable
449  * @param independentVariables Column containing the array of independent variables
450  *
451  * @par
452  * To include an intercept in the model, set one coordinate in the
453  * <tt>independentVariables</tt> array to 1.
454  *
455  * @return A composite value:
456  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
457  * - <tt>r2 FLOAT8</tt> - Coefficient of determination, \f$ R^2 \f$
458  * - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
459  * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
460  * - <tt>t_stats FLOAT8[]</tt> - Array of t-statistics, \f$ \boldsymbol t \f$
461  * - <tt>p_values FLOAT8[]</tt> - Array of p-values, \f$ \boldsymbol p \f$
462  * - <tt>condition_no FLOAT8</tt> - The condition number of matrix
463  * \f$ X^T X \f$.
464  *
465  * @usage
466  * - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
467  * statistics:\n
468  * <pre>SELECT (linregr(<em>dependentVariable</em>,
469  * <em>independentVariables</em>)).*
470  *FROM <em>sourceName</em>;</pre>
471  * - Get vector of coefficients \f$ \boldsymbol c \f$:\n
472  * <pre>SELECT (linregr(<em>dependentVariable</em>,
473  * <em>independentVariables</em>)).coef
474  *FROM <em>sourceName</em>;</pre>
475  * - Get a subset of the output columns, e.g., only the array of coefficients
476  * \f$ \boldsymbol c \f$, the coefficient of determination \f$ R^2 \f$, and
477  * the array of p-values \f$ \boldsymbol p \f$:
478  * <pre>SELECT (lr).coef, (lr).r2, (lr).p_values
479  *FROM (
480  * SELECT linregr( <em>dependentVariable</em>,
481  * <em>independentVariables</em>) AS lr
482  * FROM <em>sourceName</em>
483  *) AS subq;</pre>
484  */
485 
486 CREATE AGGREGATE MADLIB_SCHEMA.linregr(
487  /*+ "dependentVariable" */ DOUBLE PRECISION,
488  /*+ "independentVariables" */ DOUBLE PRECISION[]) (
489 
490  SFUNC=MADLIB_SCHEMA.linregr_transition,
491  STYPE=MADLIB_SCHEMA.bytea8,
492  FINALFUNC=MADLIB_SCHEMA.linregr_final,
493  m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.linregr_merge_states,')
494  INITCOND=''
495 );
496 
497 --------------------------- INTERNAL ---------------------------------------
498 
499 
500 /**
501  * @brief Return heteroskedasticity values for specific linear regression
502  * coefficients
503 **/
504 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_hsk_result(
505  source_table VARCHAR -- name of input table
506  , dependent_varname VARCHAR -- name of dependent variable
507  , independent_varname VARCHAR -- name of independent variable
508  , linregr_coeffs DOUBLE PRECISION[] -- coeffs from linear regression
509 )
510 RETURNS MADLIB_SCHEMA.heteroskedasticity_test_result AS $$
511 DECLARE
512 hsk_value MADLIB_SCHEMA.heteroskedasticity_test_result;
513 BEGIN
514  EXECUTE '
515  SELECT (MADLIB_SCHEMA.heteroskedasticity_test_linregr('
516  || dependent_varname || ' , '
517  || independent_varname || ' , '
518  || 'ARRAY[' || array_to_string(linregr_coeffs, ',') || '])).*
519  FROM ' || source_table
520  INTO hsk_value;
521 
522  RETURN hsk_value;
523 END
524 $$ LANGUAGE plpgsql VOLATILE;
525 /**
526  * @brief Return linear regression output for source data
527  *
528 **/
529 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_linreg_result(
530  source_table VARCHAR -- name of input table
531  , dependent_varname VARCHAR -- name of dependent variable
532  , independent_varname VARCHAR -- name of independent variable
533 )
534 RETURNS MADLIB_SCHEMA.linregr_result AS $$
535 DECLARE
536 lin_rst MADLIB_SCHEMA.linregr_result;
537 BEGIN
538  EXECUTE '
539  SELECT (MADLIB_SCHEMA.linregr( '
540  || dependent_varname || ' , '
541  || independent_varname || ')
542  ).*
543  FROM ' || source_table
544  INTO lin_rst;
545  RETURN lin_rst;
546 END
547 $$ LANGUAGE plpgsql VOLATILE;
548 
549 
550 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_linregr_insert_string(
551  lin_rst MADLIB_SCHEMA.linregr_result,
552  out_table TEXT
553 )
554 RETURNS VARCHAR AS $$
555 DECLARE
556  insert_string VARCHAR;
557 BEGIN
558  insert_string := 'INSERT INTO ' || out_table || ' VALUES (';
559  insert_string := insert_string ||
560  CASE
561  WHEN (lin_rst).coef is NULL
562  THEN '''{}'','
563  ELSE 'ARRAY[' || array_to_string((lin_rst).coef, ',') || '], '
564  END ||
565  CASE
566  WHEN (lin_rst).r2 is NULL
567  THEN '0.0,'
568  ELSE (lin_rst).r2 || ','
569  END ||
570  CASE
571  WHEN (lin_rst).std_err is NULL
572  THEN '''{}'','
573  ELSE 'ARRAY[' || array_to_string((lin_rst).std_err, ',') || '], '
574  END ||
575  CASE
576  WHEN (lin_rst).t_stats is NULL
577  THEN '''{}'','
578  ELSE 'ARRAY[' || array_to_string((lin_rst).t_stats, ',') || '], '
579  END ||
580  CASE
581  WHEN (lin_rst).p_values is NULL
582  THEN '''{}'','
583  ELSE 'ARRAY[' || array_to_string((lin_rst).p_values, ',') || '], '
584  END ||
585  CASE
586  WHEN (lin_rst).condition_no is NULL
587  THEN '0.0'
588  ELSE (lin_rst).condition_no
589  END;
590  RETURN insert_string;
591 END;
592 $$ LANGUAGE plpgsql VOLATILE;
593 
594 /**
595  * @brief Compute linear regression coefficients and heterskedasticity values
596  *
597  **/
598  CREATE FUNCTION MADLIB_SCHEMA.__internal_linregr_train_hetero(
599  source_table VARCHAR -- name of input table
600  , out_table VARCHAR -- name of output table
601  , dependent_varname VARCHAR -- name of dependent variable
602  , independent_varname VARCHAR -- name of independent variable
603  , heteroskedasticity_option BOOLEAN -- do you want heteroskedasticity output
604 )
605 RETURNS VOID AS $$
606 DECLARE
607 insert_string VARCHAR;
608 lin_rst MADLIB_SCHEMA.linregr_result;
609 hsk_value MADLIB_SCHEMA.heteroskedasticity_test_result;
610 BEGIN
611  IF (source_table IS NULL OR source_table = '') THEN
612  RAISE EXCEPTION 'Invalid input table name given.';
613  END IF;
614  IF (out_table IS NULL OR out_table = '') THEN
615  RAISE EXCEPTION 'Invalid output table name given.';
616  END IF;
617  IF (dependent_varname IS NULL OR dependent_varname = '') THEN
618  RAISE EXCEPTION 'Invalid dependent variable name given.';
619  END IF;
620  IF (independent_varname IS NULL OR independent_varname = '') THEN
621  RAISE EXCEPTION 'Invalid independent variable name given.';
622  END IF;
623  IF (MADLIB_SCHEMA.__table_exists(out_table)) THEN
624  RAISE EXCEPTION 'Output table name already exists. Drop the table before calling the function.';
625  END IF;
626  -- create output table with appropriate column names
627  EXECUTE 'DROP TABLE IF EXISTS ' || out_table;
628  EXECUTE '
629  CREATE TABLE ' || out_table || ' (
630  coef DOUBLE PRECISION[],
631  r2 DOUBLE PRECISION,
632  std_err DOUBLE PRECISION[],
633  t_stats DOUBLE PRECISION[],
634  p_values DOUBLE PRECISION[],
635  condition_no DOUBLE PRECISION)';
636  IF heteroskedasticity_option THEN
637  -- Alter output table to add heteroskedasticity values
638  EXECUTE '
639  ALTER TABLE ' || out_table || '
640  ADD COLUMN bp_stats DOUBLE PRECISION,
641  ADD COLUMN bp_p_value DOUBLE PRECISION';
642  END IF;
643  -- compute linear regression and heteroskedasticity values (if required)
644  lin_rst := MADLIB_SCHEMA.__internal_get_linreg_result(
645  source_table, dependent_varname, independent_varname);
646  insert_string := MADLIB_SCHEMA.__internal_get_linregr_insert_string(
647  lin_rst, out_table);
648  -- Ensure Infinity and NaN are cast properly
649  insert_string := REGEXP_REPLACE(insert_string, 'Infinity',
650  '''Infinity''::double precision', 'gi');
651  insert_string := REGEXP_REPLACE(insert_string, 'NaN',
652  '''NaN''::double precision', 'gi');
653  IF heteroskedasticity_option THEN
654  -- add hsk values in the sql string and execute
655  hsk_value := MADLIB_SCHEMA.__internal_get_hsk_result(
656  source_table, dependent_varname,
657  independent_varname, (lin_rst).coef);
658  EXECUTE
659  insert_string || ','
660  || (hsk_value).bp_stats || ','
661  || (hsk_value).bp_p_value || ')';
662  ELSE
663  -- complete the sql string and execute
664  EXECUTE insert_string || ')';
665  END IF;
666 END
667 $$ LANGUAGE plpgsql VOLATILE;
668 ---------------------------------------------------------------------------
669 /**
670  * @brief Compute linear regression coefficients and insert into an output
671  * table. The function drops the table if it already exists before creating
672  * the table.
673  *
674 **/
675 CREATE FUNCTION MADLIB_SCHEMA.linregr_train(
676  source_table VARCHAR -- name of input table
677  , out_table VARCHAR -- name of output table
678  , dependent_varname VARCHAR -- name of dependent variable
679  , independent_varname VARCHAR -- name of independent variable
680  )
681 RETURNS VOID AS $$
682 BEGIN
683  PERFORM MADLIB_SCHEMA.__internal_linregr_train_hetero(
684  source_table, out_table, dependent_varname, independent_varname, False);
685  -- RAISE NOTICE '
686  -- Finished linear regression
687  -- * table : % (%, %)
688  -- Output:
689  -- * view : SELECT * FROM % ;', source_table, dependent_varname,
690  -- independent_varname, out_table;
691 END;
692 $$ LANGUAGE plpgsql VOLATILE;
693 
694 --------------------- GROUPING ---------------------------------------------
695 /**
696  * @brief Linear regression training function with grouping support and
697  * option for heteroskedasticity values.
698  **/
699 CREATE FUNCTION MADLIB_SCHEMA.linregr_train(
700  source_table VARCHAR -- name of input table
701  , out_table VARCHAR -- name of output table
702  , dependent_varname VARCHAR -- name of dependent variable
703  , independent_varname VARCHAR -- name of independent variable
704  , input_group_cols VARCHAR -- names of columns to group-by
705  , heteroskedasticity_option BOOLEAN -- heteroskedasticity
706  )
707 RETURNS VOID AS $$
708 DECLARE
709  input_table_name VARCHAR[];
710  group_cols VARCHAR[];
711  actual_table_name VARCHAR;
712  schema_name VARCHAR;
713  table_creation_string VARCHAR;
714  group_string VARCHAR;
715  group_array_length INTEGER;
716  col_data_type VARCHAR;
717  each_group INTEGER;
718  linregr_fitting_rst VARCHAR;
719  old_msg_level TEXT;
720 BEGIN
721 
722  EXECUTE 'SELECT setting FROM pg_settings WHERE name=''client_min_messages''' INTO old_msg_level;
723  EXECUTE 'SET client_min_messages TO warning';
724 
725  IF (source_table IS NULL OR source_table = '') THEN
726  RAISE EXCEPTION 'Invalid input table name given.';
727  END IF;
728  IF (out_table IS NULL OR out_table = '') THEN
729  RAISE EXCEPTION 'Invalid output table name given.';
730  END IF;
731  IF (dependent_varname IS NULL OR dependent_varname = '') THEN
732  RAISE EXCEPTION 'Invalid dependent variable name given.';
733  END IF;
734  IF (independent_varname IS NULL OR independent_varname = '') THEN
735  RAISE EXCEPTION 'Invalid independent variable name given.';
736  END IF;
737 
738  IF (NOT MADLIB_SCHEMA.__table_exists(source_table)) THEN
739  RAISE EXCEPTION 'Source table name does not exist.';
740  END IF;
741 
742  IF (MADLIB_SCHEMA.__table_exists(out_table)) THEN
743  RAISE EXCEPTION 'Output table name already exists. Drop the table before calling the function.';
744  END IF;
745 
746 
747  -- initial validation
748  IF (input_group_cols IS NULL)
749  --OR array_upper(input_group_cols, 1) IS NULL
750  --OR array_upper(input_group_cols, 1) = 0)
751  THEN
752  PERFORM MADLIB_SCHEMA.__internal_linregr_train_hetero(
753  source_table, out_table,
754  dependent_varname, independent_varname,
755  heteroskedasticity_option);
756  ELSE
757  group_cols = MADLIB_SCHEMA._string_to_array(input_group_cols);
758  -- create output table
759  EXECUTE 'DROP TABLE IF EXISTS ' || out_table;
760  table_creation_string := 'CREATE TABLE ' || out_table || '(';
761  group_array_length = array_upper(group_cols, 1);
762 
763  input_table_name = regexp_split_to_array(source_table, E'\\.');
764  IF array_upper(input_table_name, 1) = 1 THEN
765  actual_table_name = input_table_name[1];
766  schema_name := current_schema();
767  ELSIF array_upper(input_table_name, 1) = 2 THEN
768  actual_table_name = input_table_name[2];
769  schema_name = input_table_name[1];
770  ELSE
771  RAISE EXCEPTION 'Incorrect input source table name provided';
772  END IF;
773 
774  -- Check that each grouping column exists
775  FOR each_group in 1 .. group_array_length
776  LOOP
777  if NOT MADLIB_SCHEMA.check_if_col_exists(source_table,
778  group_cols[each_group]) THEN
779  RAISE EXCEPTION 'Grouping column % does not exist',
780  group_cols[each_group];
781  END IF;
782 
783 
784  END LOOP;
785 
786  FOR each_group in 1 .. group_array_length
787  LOOP
788  -- create a string that makes list of
789  EXECUTE 'SELECT data_type FROM information_schema.columns
790  WHERE
791  table_schema = ''' || schema_name || '''
792  AND table_name = ''' || actual_table_name || '''
793  AND column_name= ''' || group_cols[each_group] || ''''
794  INTO col_data_type;
795 
796  table_creation_string := table_creation_string
797  || group_cols[each_group]
798  || ' ' || col_data_type || ',';
799  END LOOP;
800 
801  -- finish creating the output table
802  EXECUTE table_creation_string || '
803  coef DOUBLE PRECISION[],
804  r2 DOUBLE PRECISION,
805  std_err DOUBLE PRECISION[],
806  t_stats DOUBLE PRECISION[],
807  p_values DOUBLE PRECISION[],
808  condition_no DOUBLE PRECISION)';
809  IF heteroskedasticity_option THEN
810  EXECUTE 'ALTER TABLE ' || out_table || '
811  ADD COLUMN bp_stats DOUBLE PRECISION,
812  ADD COLUMN bp_p_value DOUBLE PRECISION';
813  END IF;
814 
815  group_string := '';
816  FOR each_group in 1 .. (group_array_length-1)
817  LOOP
818  group_string := group_string || group_cols[each_group] || ',';
819  END LOOP;
820  group_string := group_string || group_cols[group_array_length];
821 
822  IF heteroskedasticity_option THEN
823  linregr_fitting_rst := MADLIB_SCHEMA.__unique_string();
824  EXECUTE '
825  DROP TABLE IF EXISTS '|| linregr_fitting_rst ||';
826  CREATE TEMP TABLE '|| linregr_fitting_rst ||' AS
827  SELECT
828  '|| group_string ||',
829  (MADLIB_SCHEMA.linregr('|| dependent_varname ||','|| independent_varname ||')).*
830  FROM '|| source_table ||'
831  GROUP BY '|| group_string;
832 
833  EXECUTE '
834  INSERT INTO ' || out_table || '
835  SELECT *
836  FROM
837  '|| linregr_fitting_rst ||'
838  JOIN (
839  SELECT
840  '|| group_string ||',
841  (MADLIB_SCHEMA.heteroskedasticity_test_linregr('
842  || dependent_varname || ','
843  || independent_varname || ', t.coef)).*
844  FROM
845  '|| source_table ||' AS s
846  JOIN
847  '|| linregr_fitting_rst ||' AS t
848  USING (' || group_string || ')
849  GROUP BY ' || group_string ||') z
850  USING ('|| group_string ||')';
851 
852  EXECUTE 'DROP TABLE IF EXISTS '|| linregr_fitting_rst;
853  ELSE
854  EXECUTE '
855  INSERT INTO ' || out_table || '
856  SELECT ' || group_string || ', (result).coef, (result).r2,
857  (result).std_err, (result).t_stats,
858  (result).p_values, (result).condition_no
859  FROM (
860  SELECT ' || group_string ||
861  ', MADLIB_SCHEMA.linregr( '
862  || dependent_varname || ' , '
863  || independent_varname || ' )
864  AS result
865  FROM ' || source_table || '
866  GROUP BY ' || group_string ||
867  ') subq';
868  END IF;
869  END IF;
870 
871  EXECUTE 'SET client_min_messages TO '|| old_msg_level;
872 END;
873 $$ LANGUAGE plpgsql VOLATILE;
874 ---------------------------------------------------------------------------
875 /**
876  * @brief Linear regression training function with grouping support.
877  **/
878 CREATE FUNCTION MADLIB_SCHEMA.linregr_train(
879  source_table VARCHAR -- name of input table
880  , out_table VARCHAR -- name of output table
881  , dependent_varname VARCHAR -- name of dependent variable
882  , independent_varname VARCHAR -- name of independent variable
883  , group_cols VARCHAR -- names of columns to group-by
884  )
885 RETURNS VOID AS $$
886 BEGIN
887  PERFORM MADLIB_SCHEMA.linregr_train( source_table, out_table,
888  dependent_varname,
889  independent_varname, group_cols, FALSE);
890 END;
891 $$ LANGUAGE plpgsql VOLATILE;
892 ---------------------------------------------------------------------------