User Documentation
 All Files Functions Groups
cox_prop_hazards.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file cox_prop_hazards.sql_in
4  *
5  * @brief SQL functions for cox proportional hazards
6  * @date July 2012
7  *
8  * @sa For a brief introduction to cox regression, see the
9  * module description \ref grp_cox_prop_hazards
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 @addtogroup grp_cox_prop_hazards
17 
18 @about
19 Proportional-Hazard models enable the comparison of various survival models.
20 These survival models are functions describing the probability of an one-item
21 event (prototypically, this event is death) with respect to time.
22 The interval of time before death occurs is the survival time.
23 Let T be a random variable representing the survival time,
24 with a cumulative probability function P(t). Informally, P(t) is
25 the probability that death has happened before time t.
26 
27 Generally, applications start with a list of \f$ \boldsymbol n \f$ observations,
28 each with \f$ \boldsymbol m \f$ covariates and a time of death. From this
29 \f$ \boldsymbol n \times m \f$ matrix, we would like to derive the correlation
30 between the covariates and the hazard function. This amounts to finding
31 the parameters \f$ \boldsymbol \beta \f$ that best fit the model described below.
32 
33 Let us define:
34 - \f$ \boldsymbol t \in \mathbf R^{m} \f$ denote the vector of observed dependent
35  variables, with \f$ n \f$ rows.
36 - \f$ X \in \mathbf R^{m} \f$ denote the design matrix with \f$ m \f$
37  columns and \f$ n \f$ rows, containing all observed vectors of independent
38  variables \f$ \boldsymbol x_i \f$ as rows.
39 - \f$ R(t_i) \f$ denote the set of observations still alive at time \f$ t_i \f$
40 
41 Note that this model <b>does not</b> include a <b>constant term</b>, and the data
42 cannot contain a column of 1s.
43 
44 By definition,
45 \f[
46  P[T_k = t_i | \boldsymbol R(t_i)]
47  = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}.
48  \,.
49 \f]
50 
51 The <b>partial likelihood </b>function can now be generated as the product of
52 conditional probabilities:
53 \f[
54 \mathcal L = \prod_{i = 1}^n
55  \left(
56  \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}
57  \right).
58 \f]
59 
60 The log-likelihood form of this equation is
61 \f[
62 L = \sum_{i = 1}^n
63  \left[ \beta^T x_i
64  - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right)
65  \right].
66 \f]
67 
68 Using this score function and Hessian matrix, the partial likelihood can be
69 maximized using the <b> Newton-Raphson algorithm </b>.<b> Breslow's method </b>
70 is used to resolved tied times of deaths. The time of death for two records are
71 considered "equal" if they differ by less than 1.0e-6
72 
73 The inverse of the Hessian matrix, evaluated at the estimate of
74 \f$ \boldsymbol \beta \f$, can be used as an <b>approximate variance-covariance
75 matrix </b> for the estimate, and used to produce approximate
76 <b>standard errors</b> for the regression coefficients.
77 
78 \f[
79  \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii}
80  \,.
81 \f]
82 The Wald z-statistic is
83 \f[
84  z_i = \frac{c_i}{\mathit{se}(c_i)}
85  \,.
86 \f]
87 
88 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
89 the assumptions inherent in the Wald test) of seeing a value at least as extreme
90 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
91 true. Letting \f$ F \f$ denote the cumulative density function of a standard
92 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
93 therefore
94 \f[
95  p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
96 \f]
97 where \f$ Z \f$ is a standard normally distributed random variable.
98 
99 
100 The condition number is computed as \f$ \kappa(H) \f$ during the iteration
101 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
102 the coefficients of the previous iteration). A large condition number (say, more
103 than 1000) indicates the presence of significant multicollinearity.
104 
105 
106 @input
107 
108 The training data is expected to be of the following form:\n
109 <pre>{TABLE|VIEW} <em>sourceName</em> (
110  <em>inputTable</em> VARCHAR,
111  <em>outputTable</em> VARCHAR,
112  <em>dependentVariable</em> VARCHAR,
113  <em>independentVariable</em> VARCHAR,
114  [<em>rightCensoringStatus</em> VARCHAR]
115 )</pre>
116 Note: Dependent Variables refer to the time of death. There is no need to
117 pre-sort the data.
118 
119 NOTE2:'right_censoring_status' is set to TRUE to if the observation is not censored
120 and 'FALSE' if the observation is censored. The default value for
121 'right_censoring_status' is TRUE for all observatoions
122 @usage
123 
124 
125 <b> The Full Interface</b>
126 
127 <pre>
128 SELECT madlib.\ref cox_prop_hazards(
129  <em>'source_table'</em>, -- name of input table, VARCHAR
130  <em>'out_table'</em>, -- name of output table, VARCHAR
131  <em>'dependent_varname'</em>, -- name of dependent variable, VARCHAR
132  <em>'independent_varname'</em>, -- name of independent variable, VARCHAR
133  [<em>'right_censoring_status'</em>, -- name of column with right censoring status, VARCHAR (OPTIONAL, default=True)
134 
135 );
136 </pre>
137 
138 Here the <em>'right_censoring_status'</em> can be the name of a column, which contains
139 array of boolean values. It can also have a format of string 'dependent_variable < 10',
140 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
141 
142 Here the <em>'independent_varname'</em> can be the name of a column, which contains
143 array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]',
144 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
145 
146 
147 Output is stored in the <em>out_table</em>:
148 <pre>
149 [ coef | std_err | stats | p_values |
150 +------+---------+-------+----------+
151 </pre>
152 
153 
154 -# For function summary information. Run
155 @verbatim
156 sql> select cox_prop_hazards('help');
157 OR
158 sql> select cox_prop_hazards();
159 OR
160 sql> select cox_prop_hazards('?');
161 @endverbatim
162 
163 -# For function usage information.
164 @verbatim
165 sql> select cox_prop_hazards('usage');
166 @endverbatim
167 
168 
169 Note: The function cox_prop_hazards_regr has been deprecated but maintained
170 @examp
171 
172 -# Create the sample data set:
173 @verbatim
174 sql> SELECT * FROM data;
175  val | time | status
176 ------------|--------------
177  {0,1.95} | 35 | t
178  {0,2.20} | 28 | t
179  {1,1.45} | 32 | t
180  {1,5.25} | 31 | t
181  {1,0.38} | 21 | t
182 ...
183 @endverbatim
184 -# Run the cox regression function:
185 @verbatim
186 sql> SELECT * FROM cox_prop_hazards('data', 'result_table', 'val', 'time', 'status');
187 sql> SELECT * from result_table;
188 --------------|--------------------------------------------------------------
189 coef | {0.881089349817059,-0.0756817768938055}
190 std_err | {1.16954914708414,0.338426252282655}
191 z_stats | {0.753356711368689,-0.223628410729811}
192 p_values | {0.451235588326831,0.823046454908087}
193 
194 @endverbatim
195 
196 @literature
197 
198 A somewhat random selection of nice write-ups, with valuable pointers into
199 further literature:
200 
201 [1] John Fox: Cox Proportional-Hazards Regression for Survival Data,
202  Appendix to An R and S-PLUS companion to Applied Regression Feb 2012,
203  http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf
204 
205 [2] Stephen J Walters: What is a Cox model?
206  http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf
207 
208 
209 @note Source and column names have to be passed as strings (due to limitations
210 of the SQL syntax).
211 
212 
213 @sa File cox_prop_hazards.sql_in (documenting the SQL functions)
214 
215 @internal
216 @sa Namespace cox_prop_hazards
217  \ref madlib::modules::stats documenting the implementation in C++
218 @endinternal
219 
220 */
221 
222 
223 DROP TYPE IF EXISTS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result;
224 CREATE TYPE MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS (
225  x DOUBLE PRECISION[],
226  status BOOLEAN,
227  exp_coef_x DOUBLE PRECISION,
228  x_exp_coef_x DOUBLE PRECISION[],
229  x_xTrans_exp_coef_x DOUBLE PRECISION[]
230 );
231 
232 
233 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.intermediate_cox_prop_hazards(
234  /*+ x */ DOUBLE PRECISION[],
235  /*+ status */ BOOLEAN,
236  /*+ coef */ DOUBLE PRECISION[])
237 RETURNS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS
238 'MODULE_PATHNAME'
239 LANGUAGE c IMMUTABLE;
240 
241 
242 
243 DROP TYPE IF EXISTS MADLIB_SCHEMA.cox_prop_hazards_result;
244 CREATE TYPE MADLIB_SCHEMA.cox_prop_hazards_result AS (
245  coef DOUBLE PRECISION[],
246  logLikelihood DOUBLE PRECISION,
247  std_err DOUBLE PRECISION[],
248  z_stats DOUBLE PRECISION[],
249  p_values DOUBLE PRECISION[],
250  num_iterations INTEGER
251 );
252 
253 
254 
255 
256 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_final(
257  state DOUBLE PRECISION[])
258 RETURNS DOUBLE PRECISION[]
259 AS 'MODULE_PATHNAME'
260 LANGUAGE C IMMUTABLE STRICT;
261 
262 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_transition(
263  /*+ state */ DOUBLE PRECISION[],
264  /*+ x */ DOUBLE PRECISION[],
265  /*+ y */ DOUBLE PRECISION,
266  /*+ status */ BOOLEAN,
267  /*+ exp_coef_x */ DOUBLE PRECISION,
268  /*+ xexp_coef_x */ DOUBLE PRECISION[],
269  /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
270  /*+ previous_state */ DOUBLE PRECISION[])
271 RETURNS DOUBLE PRECISION[] AS
272 'MODULE_PATHNAME'
273 LANGUAGE C IMMUTABLE;
274 
275 
276 /**
277  * @internal
278  * @brief Perform one iteration the Newton-Rhapson method.
279  */
280 CREATE
281 m4_ifdef(`__GREENPLUM__',m4_ifdef(`__HAS_ORDERED_AGGREGATES__',`ORDERED'))
282 AGGREGATE MADLIB_SCHEMA.cox_prop_hazards_step(
283 
284  /*+ x */ DOUBLE PRECISION[],
285  /*+ y */ DOUBLE PRECISION,
286  /*+ status */ BOOLEAN,
287  /*+ exp_coef_x */ DOUBLE PRECISION,
288  /*+ xexp_coef_x */ DOUBLE PRECISION[],
289  /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
290  /*+ previous_state */ DOUBLE PRECISION[]) (
291  STYPE=DOUBLE PRECISION[],
292  SFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_transition,
293  FINALFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_final,
294  INITCOND='{0,0,0,0,0,0,0}'
295 );
296 
297 
298 
299 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_step_distance(
300  /*+ state1 */ DOUBLE PRECISION[],
301  /*+ state2 */ DOUBLE PRECISION[])
302 RETURNS DOUBLE PRECISION AS
303 'MODULE_PATHNAME'
304 LANGUAGE c IMMUTABLE STRICT;
305 
306 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_result(
307  /*+ state */ DOUBLE PRECISION[])
308 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
309 'MODULE_PATHNAME'
310 LANGUAGE c IMMUTABLE STRICT;
311 
312 
313 -- We only need to document the last one (unfortunately, in Greenplum we have to
314 -- use function overloading instead of default arguments).
315 CREATE FUNCTION MADLIB_SCHEMA.compute_cox_prop_hazards_regr(
316  "source" VARCHAR,
317  "indepColumn" VARCHAR,
318  "depColumn" VARCHAR,
319  "status" VARCHAR,
320  "maxNumIterations" INTEGER,
321  "optimizer" VARCHAR,
322  "precision" DOUBLE PRECISION)
323 RETURNS INTEGER
324 AS $$PythonFunction(stats, cox_prop_hazards, compute_cox_prop_hazards)$$
325 LANGUAGE plpythonu VOLATILE;
326 
327 
328 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
329 /**
330  * @brief Compute cox-regression coefficients and diagnostic statistics
331  *
332  * To include an intercept in the model, set one coordinate in the
333  * <tt>independentVariables</tt> array to 1.
334  *
335  * @param source Name of the source relation containing the training data
336  * @param indepColumn Name of the independent column
337  * @param depColumn Name of the dependant column measuring time of death
338  * @param maxNumIterations The maximum number of iterations
339  * @param optimizer The optimizer to use (either
340  * <tt>'newton'</tt>/<tt>'newton'</tt> for the newton method
341  * @param precision The difference between log-likelihood values in successive
342  * iterations that should indicate convergence. Note that a non-positive
343  * value here disables the convergence criterion, and execution will only
344  * stop after \c maxNumIterations iterations.
345  *
346  * @return A composite value:
347  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol \beta \f$
348  * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$l(\boldsymbol \beta)\f$
349  * - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
350  * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
351  * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
352  * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
353  * - <tt>condition_no FLOAT8</tt> - The condition number of matrix
354  * \f$ H \f$ during the iteration immediately <em>preceding</em>
355  * convergence (i.e., \f$ H \f$ is computed using the coefficients of the
356  * previous iteration)
357  * - <tt>num_iterations INTEGER</tt> - The number of iterations before the
358  * algorithm terminated
359  *
360  * - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic
361  * statistics:\n
362  * <pre>SELECT * FROM \ref cox_prop_hazards_regr(
363  * '<em>sourceName</em>', '<em>dependentVariable</em>',
364  * '<em>independentVariables</em>'
365  * [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ]
366  * );</pre>
367  * - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n
368  * <pre>SELECT (\ref cox_prop_hazards_regr('<em>sourceName</em>',
369  * '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre>
370  * - Get a subset of the output columns, e.g., only the array of coefficients
371  * \f$ \boldsymbol \beta \f$, the log-likelihood of determination:
372  * <pre>SELECT coef, log_likelihood
373  * FROM \ref cox_prop_hazards_regr('<em>sourceName</em>', '<em>dependentVariable</em>',
374  * '<em>independentVariables</em>');</pre>
375  */
376 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
377  "source" VARCHAR,
378  "indepColumn" VARCHAR,
379  "depColumn" VARCHAR,
380  "status" VARCHAR,
381  "maxNumIterations" INTEGER /*+ DEFAULT 20 */,
382  "optimizer" VARCHAR /*+ DEFAULT 'newton' */,
383  "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */)
384 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$
385 DECLARE
386  theIteration INTEGER;
387  fnName VARCHAR;
388  theResult MADLIB_SCHEMA.cox_prop_hazards_result;
389 BEGIN
390  theIteration := (
391  SELECT MADLIB_SCHEMA.compute_cox_prop_hazards_regr($1, $2, $3, $4, $5, $6, $7)
392  );
393  IF optimizer = 'newton' THEN
394  fnName := 'internal_cox_prop_hazards_result';
395  ELSE
396  RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer;
397  END IF;
398  EXECUTE
399  $sql$
400  SELECT (result).*
401  FROM (
402  SELECT
403  MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result
404  FROM _madlib_iterative_alg
405  WHERE _madlib_iteration = $sql$ || theIteration || $sql$
406  ) subq
407  $sql$
408  INTO theResult;
409 
410  -- The number of iterations are not updated in the C++ code. We do it here.
411  IF NOT (theResult IS NULL) THEN
412  theResult.num_iterations = theIteration;
413  END IF;
414  RETURN theResult;
415 END;
416 $$ LANGUAGE plpgsql VOLATILE;
417 
418 
419 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
420 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
421  "source" VARCHAR,
422  "indepColumn" VARCHAR,
423  "depColumn" VARCHAR,
424  "status" VARCHAR)
425 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
426 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, 20, 'newton', 0.0001);$$
427 LANGUAGE sql VOLATILE;
428 
429 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
430 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
431  "source" VARCHAR,
432  "indepColumn" VARCHAR,
433  "depColumn" VARCHAR,
434  "status" VARCHAR,
435  "maxNumIterations" INTEGER)
436 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
437 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, $5, 'newton', 0.0001);$$
438 LANGUAGE sql VOLATILE;
439 
440 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
441 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
442  "source" VARCHAR,
443  "indepColumn" VARCHAR,
444  "depColumn" VARCHAR,
445  "status" VARCHAR,
446  "maxNumIterations" INTEGER,
447  "optimizer" VARCHAR)
448 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
449 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, $5, $6, 0.0001);$$
450 LANGUAGE sql VOLATILE;
451 
452 
453 /**
454  * @brief Cox regresison training function
455  **/
456 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
457  source_table VARCHAR -- name of input table
458  , out_table VARCHAR -- name of output table
459  , dependent_varname VARCHAR -- name of dependent variable
460  , independent_varname VARCHAR -- name of independent variable
461  , status VARCHAR -- censoring status
462  )
463 RETURNS VOID AS $$
464 DECLARE
465 insert_string VARCHAR;
466 cox_rst MADLIB_SCHEMA.cox_prop_hazards_result;
467 old_msg_level TEXT;
468 BEGIN
469  EXECUTE 'SELECT setting FROM pg_settings WHERE name=''client_min_messages''' INTO old_msg_level;
470  EXECUTE 'SET client_min_messages TO warning';
471  IF (source_table IS NULL OR source_table = '') THEN
472  RAISE EXCEPTION 'Invalid input table name given.';
473  END IF;
474  IF (NOT MADLIB_SCHEMA.__table_exists(source_table)) THEN
475  RAISE EXCEPTION 'Input table name does not exist.';
476  END IF;
477  IF (MADLIB_SCHEMA.__table_exists(out_table)) THEN
478  RAISE EXCEPTION 'Output table name already exists. Drop the table before calling the function.';
479  END IF;
480  IF (out_table IS NULL OR out_table = '') THEN
481  RAISE EXCEPTION 'Invalid output table name given.';
482  END IF;
483  IF (dependent_varname IS NULL OR dependent_varname = '') THEN
484  RAISE EXCEPTION 'Invalid dependent variable name given.';
485  END IF;
486  IF (independent_varname IS NULL OR independent_varname = '') THEN
487  RAISE EXCEPTION 'Invalid independent variable name given.';
488  END IF;
489  IF (status IS NULL OR independent_varname = '') THEN
490  RAISE EXCEPTION 'Invalid status name given.';
491  END IF;
492  -- create output table with appropriate column names
493  EXECUTE 'DROP TABLE IF EXISTS ' || out_table;
494  EXECUTE '
495  CREATE TABLE ' || out_table || ' (
496  coef DOUBLE PRECISION[],
497  std_err DOUBLE PRECISION[],
498  z_stats DOUBLE PRECISION[],
499  p_values DOUBLE PRECISION[])';
500  -- compute cox regression results
501  cox_rst := MADLIB_SCHEMA.__internal_get_cox_prop_hazards_result(
502  source_table, dependent_varname, independent_varname, status);
503  insert_string := MADLIB_SCHEMA.__internal_get_cox_prop_hazards_insert_string(
504  cox_rst, out_table);
505  -- Ensure Infinity and NaN are cast properly
506  insert_string := REGEXP_REPLACE(insert_string, 'Infinity',
507  '''Infinity''::double precision', 'gi');
508  insert_string := REGEXP_REPLACE(insert_string, 'NaN',
509  '''NaN''::double precision', 'gi');
510  EXECUTE insert_string;
511  EXECUTE 'SET client_min_messages TO '|| old_msg_level;
512 END;
513 $$ LANGUAGE plpgsql VOLATILE;
514 
515 
516 
517 /**
518  * @brief Return cox regression output for source data
519  *
520 **/
521 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_cox_prop_hazards_result(
522  source_table VARCHAR -- name of input table
523  , dependent_varname VARCHAR -- name of dependent variable
524  , independent_varname VARCHAR -- name of independent variable
525  , status VARCHAR -- name of status variable
526 )
527 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$
528 DECLARE
529 cox_rst MADLIB_SCHEMA.cox_prop_hazards_result;
530 BEGIN
531  EXECUTE '
532  SELECT (MADLIB_SCHEMA.cox_prop_hazards_regr('
533  || '''' || source_table || ''', '
534  || '''' || independent_varname || ''', '
535  || '''' || dependent_varname || ''', '
536  || '''' || status || ''')).* '
537  INTO cox_rst;
538  RETURN cox_rst;
539 END;
540 $$ LANGUAGE plpgsql VOLATILE;
541 
542 
543 /**
544  * @brief Return cox regression output for insert string
545  *
546 **/
547 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_cox_prop_hazards_insert_string(
548  cox_rst MADLIB_SCHEMA.cox_prop_hazards_result,
549  out_table TEXT
550 )
551 RETURNS VARCHAR AS $$
552 DECLARE
553  insert_string VARCHAR;
554 BEGIN
555  insert_string := 'INSERT INTO ' || out_table || ' VALUES (';
556  insert_string := insert_string ||
557  CASE
558  WHEN (cox_rst).coef is NULL
559  THEN '''{}'','
560  ELSE 'ARRAY[' || array_to_string((cox_rst).coef, ',') || '], '
561  END ||
562  CASE
563  WHEN (cox_rst).std_err is NULL
564  THEN '''{}'','
565  ELSE 'ARRAY[' || array_to_string((cox_rst).std_err, ',') || '], '
566  END ||
567  CASE
568  WHEN (cox_rst).z_stats is NULL
569  THEN '''{}'','
570  ELSE 'ARRAY[' || array_to_string((cox_rst).z_stats, ',') || '], '
571  END ||
572  CASE
573  WHEN (cox_rst).p_values is NULL
574  THEN '''{}'','
575  ELSE 'ARRAY[' || array_to_string((cox_rst).p_values, ',') || ']) '
576  END;
577  RETURN insert_string;
578 END;
579 $$ LANGUAGE plpgsql VOLATILE;
580 
581 
582 /**
583  * @brief Cox regresison training function
584  **/
585 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
586  source_table VARCHAR -- name of input table
587  , out_table VARCHAR -- name of output table
588  , dependent_variable VARCHAR -- name of dependent variable
589  , independent_variable VARCHAR -- name of independent variable
590  )
591 RETURNS VOID AS $$
592 BEGIN
593  EXECUTE 'SELECT MADLIB_SCHEMA.cox_prop_hazards(' ||
594  ' ''' || source_table || ''' ' ||
595  ' ,''' || out_table || ''' ' ||
596  ' ,''' || dependent_variable || ''' ' ||
597  ' ,''' || independent_variable || ''' ' ||
598  ' , ''TRUE'')';
599 END;
600 $$ LANGUAGE plpgsql VOLATILE;
601 
602 
603 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards()
604 RETURNS VARCHAR AS $$
605 BEGIN
606  RETURN MADLIB_SCHEMA.cox_prop_hazards('');
607 END;
608 $$ LANGUAGE plpgsql VOLATILE;
609 
610 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
611  usage_string VARCHAR -- usage string
612 )
613 RETURNS VARCHAR AS $$
614 DECLARE
615 insert_string VARCHAR;
616 BEGIN
617  IF (usage_string = '' OR usage_string = 'help' OR usage_string = '?') THEN
618  insert_string := '' ||
619  E'Summary \n' ||
620  E'-----------------------------------------------------------------------------------------\n' ||
621  E' Functionality: Cox proprtional hazards regression (Breslow method)\n' ||
622  E' SELECT {schema_madlib}.cox_prop_hazards(''source_table'' \n' ||
623  E' ,''output_table'' \n' ||
624  E' ,''dependent_variable'' \n' ||
625  E' ,''independent_variable'' \n' ||
626  E' ,''right_censoring_status'' \n' ||
627  E' );\n' ||
628  E'For more details on function usage: \n' ||
629  E'SELECT {schema_madlib}.cox_prop_hazards(''usage'') \n' ||
630  E'';
631  ElSIF (usage_string = 'usage') THEN
632  insert_string := '' ||
633  E'Usage\n' ||
634  E'-----------------------------------------------------------------------------------------\n' ||
635  E' SELECT {schema_madlib}.cox_prop_hazards( \n' ||
636  E' ''source_table'', -- Name of data table \n' ||
637  E' ''output_table'', -- Name of result table (overwrites if exists) \n' ||
638  E' ''dependent_variable'', -- Name of column for dependent variables\n' ||
639  E' ''independent_variable'', -- Name of column for independent variables\n' ||
640  E' (can be any SQL expression Eg: ''*'') \n' ||
641  E' [''right_censoring_status'', -- Name of the column containing censoring status \n' ||
642  E' 0/false : If the observation is censored\n' ||
643  E' 1/true : otherwise\n' ||
644  E' Default is 1/true for all observations\n' ||
645  E' Can also be an SQL expression: ''dependent_variable < 10'') \n' ||
646  E' );\n' ||
647  E'\n' ||
648  E'Output:\n' ||
649  E'-----------------------------------------------------------------------------------------\n' ||
650  E' The output table (''output_table'' above) has the following columns\n' ||
651  E' ''coef'' DOUBLE PRECISION[], -- Coefficients of regression \n' ||
652  E' ''std_err'' DOUBLE PRECISION[], -- Standard errors\n' ||
653  E' ''z_stats'' DOUBLE PRECISION[], -- z-stats of the standard errors\n' ||
654  E' ''p_values'' DOUBLE PRECISION[], -- p-values of the standard errors\n' ||
655  E'\n' ||
656  E'';
657  ELSE
658  insert_string := 'No such option. Run SELECT {schema_madlib}.cox_prop_hazards()';
659  END IF;
660  RETURN insert_string;
661 END;
662 $$ LANGUAGE plpgsql VOLATILE;