User Documentation
 All Files Functions Groups
cox_prop_hazards.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file cox_prop_hazards.sql_in
4  *
5  * @brief SQL functions for cox proportional hazards
6  * @date July 2012
7  *
8  * @sa For a brief introduction to cox regression, see the
9  * module description \ref grp_cox_prop_hazards
10  *
11  *//* ----------------------------------------------------------------------- */
12 
13 m4_include(`SQLCommon.m4')
14 
15 /**
16 @addtogroup grp_cox_prop_hazards
17 
18 @about
19 Proportional-Hazard models enable the comparison of various survival models.
20 These survival models are functions describing the probability of an one-item
21 event (prototypically, this event is death) with respect to time.
22 The interval of time before death occurs is the survival time.
23 Let T be a random variable representing the survival time,
24 with a cumulative probability function P(t). Informally, P(t) is
25 the probability that death has happened before time t.
26 
27 Generally, applications start with a list of \f$ \boldsymbol n \f$ observations,
28 each with \f$ \boldsymbol m \f$ covariates and a time of death. From this
29 \f$ \boldsymbol n \times m \f$ matrix, we would like to derive the correlation
30 between the covariates and the hazard function. This amounts to finding
31 the parameters \f$ \boldsymbol \beta \f$ that best fit the model described below.
32 
33 Let us define:
34 - \f$ \boldsymbol t \in \mathbf R^{m} \f$ denote the vector of observed dependent
35  variables, with \f$ n \f$ rows.
36 - \f$ X \in \mathbf R^{m} \f$ denote the design matrix with \f$ m \f$
37  columns and \f$ n \f$ rows, containing all observed vectors of independent
38  variables \f$ \boldsymbol x_i \f$ as rows.
39 - \f$ R(t_i) \f$ denote the set of observations still alive at time \f$ t_i \f$
40 
41 Note that this model <b>does not</b> include a <b>constant term</b>, and the data
42 cannot contain a column of 1s.
43 
44 By definition,
45 \f[
46  P[T_k = t_i | \boldsymbol R(t_i)]
47  = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}.
48  \,.
49 \f]
50 
51 The <b>partial likelihood </b>function can now be generated as the product of
52 conditional probabilities:
53 \f[
54 \mathcal L = \prod_{i = 1}^n
55  \left(
56  \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}
57  \right).
58 \f]
59 
60 The log-likelihood form of this equation is
61 \f[
62 L = \sum_{i = 1}^n
63  \left[ \beta^T x_i
64  - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right)
65  \right].
66 \f]
67 
68 Using this score function and Hessian matrix, the partial likelihood can be
69 maximized using the <b> Newton-Raphson algorithm </b>.<b> Breslow's method </b>
70 is used to resolved tied times of deaths. The time of death for two records are
71 considered "equal" if they differ by less than 1.0e-6
72 
73 The inverse of the Hessian matrix, evaluated at the estimate of
74 \f$ \boldsymbol \beta \f$, can be used as an <b>approximate variance-covariance
75 matrix </b> for the estimate, and used to produce approximate
76 <b>standard errors</b> for the regression coefficients.
77 
78 \f[
79  \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii}
80  \,.
81 \f]
82 The Wald z-statistic is
83 \f[
84  z_i = \frac{c_i}{\mathit{se}(c_i)}
85  \,.
86 \f]
87 
88 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
89 the assumptions inherent in the Wald test) of seeing a value at least as extreme
90 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
91 true. Letting \f$ F \f$ denote the cumulative density function of a standard
92 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
93 therefore
94 \f[
95  p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
96 \f]
97 where \f$ Z \f$ is a standard normally distributed random variable.
98 
99 
100 The condition number is computed as \f$ \kappa(H) \f$ during the iteration
101 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
102 the coefficients of the previous iteration). A large condition number (say, more
103 than 1000) indicates the presence of significant multicollinearity.
104 
105 
106 @input
107 
108 The training data is expected to be of the following form:\n
109 <pre>{TABLE|VIEW} <em>sourceName</em> (
110  <em>inputTable</em> VARCHAR,
111  <em>outputTable</em> VARCHAR,
112  <em>dependentVariable</em> VARCHAR,
113  <em>independentVariable</em> VARCHAR,
114  [<em>rightCensoringStatus</em> VARCHAR]
115 )</pre>
116 Note: Dependent Variables refer to the time of death. There is no need to
117 pre-sort the data.
118 
119 NOTE2:'right_censoring_status' is set to TRUE to if the observation is not censored
120 and 'FALSE' if the observation is censored. The default value for
121 'right_censoring_status' is TRUE for all observatoions
122 @usage
123 
124 
125 <b> The Full Interface</b>
126 
127 <pre>
128 SELECT madlib.\ref cox_prop_hazards(
129  <em>'source_table'</em>, -- name of input table, VARCHAR
130  <em>'out_table'</em>, -- name of output table, VARCHAR
131  <em>'dependent_varname'</em>, -- name of dependent variable, VARCHAR
132  <em>'independent_varname'</em>, -- name of independent variable, VARCHAR
133  [<em>'right_censoring_status'</em>, -- name of column with right censoring status, VARCHAR (OPTIONAL, default=True)
134 
135 );
136 </pre>
137 
138 Here the <em>'right_censoring_status'</em> can be the name of a column, which contains
139 array of boolean values. It can also have a format of string 'dependent_variable < 10',
140 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
141 
142 Here the <em>'independent_varname'</em> can be the name of a column, which contains
143 array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]',
144 where <em>x1</em>, <em>x2</em> and <em>x3</em> are all column names.
145 
146 
147 Output is stored in the <em>out_table</em>:
148 <pre>
149 [ coef | std_err | stats | p_values |
150 +------+---------+-------+----------+
151 </pre>
152 
153 
154 -# For function summary information. Run
155 @verbatim
156 sql> select cox_prop_hazards('help');
157 OR
158 sql> select cox_prop_hazards();
159 OR
160 sql> select cox_prop_hazards('?');
161 @endverbatim
162 
163 -# For function usage information.
164 @verbatim
165 sql> select cox_prop_hazards('usage');
166 @endverbatim
167 
168 
169 Note: The function cox_prop_hazards_regr has been deprecated but maintained
170 @examp
171 
172 -# Create the sample data set:
173 @verbatim
174 sql> SELECT * FROM data;
175  val | time | status
176 ------------|--------------
177  {0,1.95} | 35 | t
178  {0,2.20} | 28 | t
179  {1,1.45} | 32 | t
180  {1,5.25} | 31 | t
181  {1,0.38} | 21 | t
182 ...
183 @endverbatim
184 -# Run the cox regression function:
185 @verbatim
186 sql> SELECT * FROM cox_prop_hazards('data', 'result_table', 'val', 'time', 'status');
187 sql> SELECT * from result_table;
188 --------------|--------------------------------------------------------------
189 coef | {0.881089349817059,-0.0756817768938055}
190 std_err | {1.16954914708414,0.338426252282655}
191 z_stats | {0.753356711368689,-0.223628410729811}
192 p_values | {0.451235588326831,0.823046454908087}
193 
194 @endverbatim
195 
196 @literature
197 
198 A somewhat random selection of nice write-ups, with valuable pointers into
199 further literature:
200 
201 [1] John Fox: Cox Proportional-Hazards Regression for Survival Data,
202  Appendix to An R and S-PLUS companion to Applied Regression Feb 2012,
203  http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf
204 
205 [2] Stephen J Walters: What is a Cox model?
206  http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf
207 
208 
209 @note Source and column names have to be passed as strings (due to limitations
210 of the SQL syntax).
211 
212 
213 @sa File cox_prop_hazards.sql_in (documenting the SQL functions)
214 
215 @internal
216 @sa Namespace cox_prop_hazards
217  \ref madlib::modules::stats documenting the implementation in C++
218 @endinternal
219 
220 */
221 
222 
223 DROP TYPE IF EXISTS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result;
224 CREATE TYPE MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS (
225  x DOUBLE PRECISION[],
226  status BOOLEAN,
227  exp_coef_x DOUBLE PRECISION,
228  x_exp_coef_x DOUBLE PRECISION[],
229  x_xTrans_exp_coef_x DOUBLE PRECISION[]
230 );
231 
232 
233 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.intermediate_cox_prop_hazards(
234  /*+ x */ DOUBLE PRECISION[],
235  /*+ status */ BOOLEAN,
236  /*+ coef */ DOUBLE PRECISION[])
237 RETURNS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS
238 'MODULE_PATHNAME'
239 LANGUAGE c IMMUTABLE;
240 
241 
242 
243 DROP TYPE IF EXISTS MADLIB_SCHEMA.cox_prop_hazards_result;
244 CREATE TYPE MADLIB_SCHEMA.cox_prop_hazards_result AS (
245  coef DOUBLE PRECISION[],
246  logLikelihood DOUBLE PRECISION,
247  std_err DOUBLE PRECISION[],
248  z_stats DOUBLE PRECISION[],
249  p_values DOUBLE PRECISION[],
250  num_iterations INTEGER
251 );
252 
253 
254 
255 
256 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_final(
257  state DOUBLE PRECISION[])
258 RETURNS DOUBLE PRECISION[]
259 AS 'MODULE_PATHNAME'
260 LANGUAGE C IMMUTABLE STRICT;
261 
262 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_transition(
263  /*+ state */ DOUBLE PRECISION[],
264  /*+ x */ DOUBLE PRECISION[],
265  /*+ y */ DOUBLE PRECISION,
266  /*+ status */ BOOLEAN,
267  /*+ exp_coef_x */ DOUBLE PRECISION,
268  /*+ xexp_coef_x */ DOUBLE PRECISION[],
269  /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
270  /*+ previous_state */ DOUBLE PRECISION[])
271 RETURNS DOUBLE PRECISION[] AS
272 'MODULE_PATHNAME'
273 LANGUAGE C IMMUTABLE;
274 
275 
276 /**
277  * @internal
278  * @brief Perform one iteration the Newton-Rhapson method.
279  */
280 CREATE
281 m4_ifdef(`__GREENPLUM__',m4_ifdef(`__HAS_ORDERED_AGGREGATES__',`ORDERED'))
282 AGGREGATE MADLIB_SCHEMA.cox_prop_hazards_step(
283 
284  /*+ x */ DOUBLE PRECISION[],
285  /*+ y */ DOUBLE PRECISION,
286  /*+ status */ BOOLEAN,
287  /*+ exp_coef_x */ DOUBLE PRECISION,
288  /*+ xexp_coef_x */ DOUBLE PRECISION[],
289  /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
290  /*+ previous_state */ DOUBLE PRECISION[]) (
291  STYPE=DOUBLE PRECISION[],
292  SFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_transition,
293  FINALFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_final,
294  INITCOND='{0,0,0,0,0,0,0}'
295 );
296 
297 
298 
299 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_step_distance(
300  /*+ state1 */ DOUBLE PRECISION[],
301  /*+ state2 */ DOUBLE PRECISION[])
302 RETURNS DOUBLE PRECISION AS
303 'MODULE_PATHNAME'
304 LANGUAGE c IMMUTABLE STRICT;
305 
306 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_result(
307  /*+ state */ DOUBLE PRECISION[])
308 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
309 'MODULE_PATHNAME'
310 LANGUAGE c IMMUTABLE STRICT;
311 
312 
313 -- We only need to document the last one (unfortunately, in Greenplum we have to
314 -- use function overloading instead of default arguments).
315 CREATE FUNCTION MADLIB_SCHEMA.compute_cox_prop_hazards_regr(
316  "source" VARCHAR,
317  "indepColumn" VARCHAR,
318  "depColumn" VARCHAR,
319  "status" VARCHAR,
320  "maxNumIterations" INTEGER,
321  "optimizer" VARCHAR,
322  "precision" DOUBLE PRECISION)
323 RETURNS INTEGER
324 AS $$PythonFunction(stats, cox_prop_hazards, compute_cox_prop_hazards)$$
325 LANGUAGE plpythonu VOLATILE;
326 
327 
328 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
329 /**
330  * @brief Compute cox-regression coefficients and diagnostic statistics
331  *
332  * To include an intercept in the model, set one coordinate in the
333  * <tt>independentVariables</tt> array to 1.
334  *
335  * @param source Name of the source relation containing the training data
336  * @param indepColumn Name of the independent column
337  * @param depColumn Name of the dependent column measuring time of death
338  * @param status Name of the column that determines right censoring support
339  * @param maxNumIterations The maximum number of iterations
340  * @param optimizer The optimizer to use (either
341  * <tt>'newton'</tt>/<tt>'newton'</tt> for the newton method
342  * @param precision The difference between log-likelihood values in successive
343  * iterations that should indicate convergence. Note that a non-positive
344  * value here disables the convergence criterion, and execution will only
345  * stop after \c maxNumIterations iterations.
346  *
347  * @return A composite value:
348  * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol \beta \f$
349  * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$l(\boldsymbol \beta)\f$
350  * - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
351  * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
352  * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
353  * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
354  * - <tt>condition_no FLOAT8</tt> - The condition number of matrix
355  * \f$ H \f$ during the iteration immediately <em>preceding</em>
356  * convergence (i.e., \f$ H \f$ is computed using the coefficients of the
357  * previous iteration)
358  * - <tt>num_iterations INTEGER</tt> - The number of iterations before the
359  * algorithm terminated
360  *
361  * - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic
362  * statistics:\n
363  * <pre>SELECT * FROM \ref cox_prop_hazards_regr(
364  * '<em>sourceName</em>', '<em>dependentVariable</em>',
365  * '<em>independentVariables</em>'
366  * [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ]
367  * );</pre>
368  * - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n
369  * <pre>SELECT (\ref cox_prop_hazards_regr('<em>sourceName</em>',
370  * '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre>
371  * - Get a subset of the output columns, e.g., only the array of coefficients
372  * \f$ \boldsymbol \beta \f$, the log-likelihood of determination:
373  * <pre>SELECT coef, log_likelihood
374  * FROM \ref cox_prop_hazards_regr('<em>sourceName</em>', '<em>dependentVariable</em>',
375  * '<em>independentVariables</em>');</pre>
376  */
377 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
378  "source" VARCHAR,
379  "indepColumn" VARCHAR,
380  "depColumn" VARCHAR,
381  "status" VARCHAR,
382  "maxNumIterations" INTEGER /*+ DEFAULT 20 */,
383  "optimizer" VARCHAR /*+ DEFAULT 'newton' */,
384  "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */)
385 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$
386 DECLARE
387  theIteration INTEGER;
388  fnName VARCHAR;
389  theResult MADLIB_SCHEMA.cox_prop_hazards_result;
390 BEGIN
391  theIteration := (
392  SELECT MADLIB_SCHEMA.compute_cox_prop_hazards_regr($1, $2, $3, $4, $5, $6, $7)
393  );
394  IF optimizer = 'newton' THEN
395  fnName := 'internal_cox_prop_hazards_result';
396  ELSE
397  RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer;
398  END IF;
399  EXECUTE
400  $sql$
401  SELECT (result).*
402  FROM (
403  SELECT
404  MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result
405  FROM _madlib_iterative_alg
406  WHERE _madlib_iteration = $sql$ || theIteration || $sql$
407  ) subq
408  $sql$
409  INTO theResult;
410 
411  -- The number of iterations are not updated in the C++ code. We do it here.
412  IF NOT (theResult IS NULL) THEN
413  theResult.num_iterations = theIteration;
414  END IF;
415  RETURN theResult;
416 END;
417 $$ LANGUAGE plpgsql VOLATILE;
418 
419 
420 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
421 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
422  "source" VARCHAR,
423  "indepColumn" VARCHAR,
424  "depColumn" VARCHAR,
425  "status" VARCHAR)
426 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
427 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, 20, 'newton', 0.0001);$$
428 LANGUAGE sql VOLATILE;
429 
430 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
431 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
432  "source" VARCHAR,
433  "indepColumn" VARCHAR,
434  "depColumn" VARCHAR,
435  "status" VARCHAR,
436  "maxNumIterations" INTEGER)
437 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
438 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, $5, 'newton', 0.0001);$$
439 LANGUAGE sql VOLATILE;
440 
441 -- Note: The function cox_prop_hazards_regr has been deprecated but maintained
442 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_regr(
443  "source" VARCHAR,
444  "indepColumn" VARCHAR,
445  "depColumn" VARCHAR,
446  "status" VARCHAR,
447  "maxNumIterations" INTEGER,
448  "optimizer" VARCHAR)
449 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
450 $$SELECT MADLIB_SCHEMA.cox_prop_hazards_regr($1, $2, $3, $4, $5, $6, 0.0001);$$
451 LANGUAGE sql VOLATILE;
452 
453 
454 /**
455  * @brief Cox regresison training function
456  **/
457 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
458  source_table VARCHAR -- name of input table
459  , out_table VARCHAR -- name of output table
460  , dependent_varname VARCHAR -- name of dependent variable
461  , independent_varname VARCHAR -- name of independent variable
462  , status VARCHAR -- censoring status
463  )
464 RETURNS VOID AS $$
465 DECLARE
466 insert_string VARCHAR;
467 cox_rst MADLIB_SCHEMA.cox_prop_hazards_result;
468 old_msg_level TEXT;
469 BEGIN
470  EXECUTE 'SELECT setting FROM pg_settings WHERE name=''client_min_messages''' INTO old_msg_level;
471  EXECUTE 'SET client_min_messages TO warning';
472  IF (source_table IS NULL OR source_table = '') THEN
473  RAISE EXCEPTION 'Invalid input table name given.';
474  END IF;
475  IF (NOT MADLIB_SCHEMA.__table_exists(source_table)) THEN
476  RAISE EXCEPTION 'Input table name does not exist.';
477  END IF;
478  IF (MADLIB_SCHEMA.__table_exists(out_table)) THEN
479  RAISE EXCEPTION 'Output table name already exists. Drop the table before calling the function.';
480  END IF;
481  IF (out_table IS NULL OR out_table = '') THEN
482  RAISE EXCEPTION 'Invalid output table name given.';
483  END IF;
484  IF (dependent_varname IS NULL OR dependent_varname = '') THEN
485  RAISE EXCEPTION 'Invalid dependent variable name given.';
486  END IF;
487  IF (independent_varname IS NULL OR independent_varname = '') THEN
488  RAISE EXCEPTION 'Invalid independent variable name given.';
489  END IF;
490  IF (status IS NULL OR independent_varname = '') THEN
491  RAISE EXCEPTION 'Invalid status name given.';
492  END IF;
493  -- create output table with appropriate column names
494  EXECUTE 'DROP TABLE IF EXISTS ' || out_table;
495  EXECUTE '
496  CREATE TABLE ' || out_table || ' (
497  coef DOUBLE PRECISION[],
498  std_err DOUBLE PRECISION[],
499  z_stats DOUBLE PRECISION[],
500  p_values DOUBLE PRECISION[])';
501  -- compute cox regression results
502  cox_rst := MADLIB_SCHEMA.__internal_get_cox_prop_hazards_result(
503  source_table, dependent_varname, independent_varname, status);
504  insert_string := MADLIB_SCHEMA.__internal_get_cox_prop_hazards_insert_string(
505  cox_rst, out_table);
506  -- Ensure Infinity and NaN are cast properly
507  insert_string := REGEXP_REPLACE(insert_string, 'Infinity',
508  '''Infinity''::double precision', 'gi');
509  insert_string := REGEXP_REPLACE(insert_string, 'NaN',
510  '''NaN''::double precision', 'gi');
511  EXECUTE insert_string;
512  EXECUTE 'SET client_min_messages TO '|| old_msg_level;
513 END;
514 $$ LANGUAGE plpgsql VOLATILE;
515 
516 
517 
518 /**
519  * @brief Return cox regression output for source data
520  *
521 **/
522 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_cox_prop_hazards_result(
523  source_table VARCHAR -- name of input table
524  , dependent_varname VARCHAR -- name of dependent variable
525  , independent_varname VARCHAR -- name of independent variable
526  , status VARCHAR -- name of status variable
527 )
528 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$
529 DECLARE
530 cox_rst MADLIB_SCHEMA.cox_prop_hazards_result;
531 BEGIN
532  EXECUTE '
533  SELECT (MADLIB_SCHEMA.cox_prop_hazards_regr('
534  || '''' || source_table || ''', '
535  || '''' || independent_varname || ''', '
536  || '''' || dependent_varname || ''', '
537  || '''' || status || ''')).* '
538  INTO cox_rst;
539  RETURN cox_rst;
540 END;
541 $$ LANGUAGE plpgsql VOLATILE;
542 
543 
544 /**
545  * @brief Return cox regression output for insert string
546  *
547 **/
548 CREATE FUNCTION MADLIB_SCHEMA.__internal_get_cox_prop_hazards_insert_string(
549  cox_rst MADLIB_SCHEMA.cox_prop_hazards_result,
550  out_table TEXT
551 )
552 RETURNS VARCHAR AS $$
553 DECLARE
554  insert_string VARCHAR;
555 BEGIN
556  insert_string := 'INSERT INTO ' || out_table || ' VALUES (';
557  insert_string := insert_string ||
558  CASE
559  WHEN (cox_rst).coef is NULL
560  THEN '''{}'','
561  ELSE 'ARRAY[' || array_to_string((cox_rst).coef, ',') || '], '
562  END ||
563  CASE
564  WHEN (cox_rst).std_err is NULL
565  THEN '''{}'','
566  ELSE 'ARRAY[' || array_to_string((cox_rst).std_err, ',') || '], '
567  END ||
568  CASE
569  WHEN (cox_rst).z_stats is NULL
570  THEN '''{}'','
571  ELSE 'ARRAY[' || array_to_string((cox_rst).z_stats, ',') || '], '
572  END ||
573  CASE
574  WHEN (cox_rst).p_values is NULL
575  THEN '''{}'','
576  ELSE 'ARRAY[' || array_to_string((cox_rst).p_values, ',') || ']) '
577  END;
578  RETURN insert_string;
579 END;
580 $$ LANGUAGE plpgsql VOLATILE;
581 
582 
583 /**
584  * @brief Cox regresison training function
585  **/
586 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
587  source_table VARCHAR -- name of input table
588  , out_table VARCHAR -- name of output table
589  , dependent_variable VARCHAR -- name of dependent variable
590  , independent_variable VARCHAR -- name of independent variable
591  )
592 RETURNS VOID AS $$
593 BEGIN
594  EXECUTE 'SELECT MADLIB_SCHEMA.cox_prop_hazards(' ||
595  ' ''' || source_table || ''' ' ||
596  ' ,''' || out_table || ''' ' ||
597  ' ,''' || dependent_variable || ''' ' ||
598  ' ,''' || independent_variable || ''' ' ||
599  ' , ''TRUE'')';
600 END;
601 $$ LANGUAGE plpgsql VOLATILE;
602 
603 
604 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards()
605 RETURNS VARCHAR AS $$
606 BEGIN
607  RETURN MADLIB_SCHEMA.cox_prop_hazards('');
608 END;
609 $$ LANGUAGE plpgsql VOLATILE;
610 
611 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
612  usage_string VARCHAR -- usage string
613 )
614 RETURNS VARCHAR AS $$
615 DECLARE
616 insert_string VARCHAR;
617 BEGIN
618  IF (usage_string = '' OR usage_string = 'help' OR usage_string = '?') THEN
619  insert_string := '' ||
620  E'Summary \n' ||
621  E'-----------------------------------------------------------------------------------------\n' ||
622  E' Functionality: Cox proprtional hazards regression (Breslow method)\n' ||
623  E' SELECT {schema_madlib}.cox_prop_hazards(''source_table'' \n' ||
624  E' ,''output_table'' \n' ||
625  E' ,''dependent_variable'' \n' ||
626  E' ,''independent_variable'' \n' ||
627  E' ,''right_censoring_status'' \n' ||
628  E' );\n' ||
629  E'For more details on function usage: \n' ||
630  E'SELECT {schema_madlib}.cox_prop_hazards(''usage'') \n' ||
631  E'';
632  ElSIF (usage_string = 'usage') THEN
633  insert_string := '' ||
634  E'Usage\n' ||
635  E'-----------------------------------------------------------------------------------------\n' ||
636  E' SELECT {schema_madlib}.cox_prop_hazards( \n' ||
637  E' ''source_table'', -- Name of data table \n' ||
638  E' ''output_table'', -- Name of result table (overwrites if exists) \n' ||
639  E' ''dependent_variable'', -- Name of column for dependent variables\n' ||
640  E' ''independent_variable'', -- Name of column for independent variables\n' ||
641  E' (can be any SQL expression Eg: ''*'') \n' ||
642  E' [''right_censoring_status'', -- Name of the column containing censoring status \n' ||
643  E' 0/false : If the observation is censored\n' ||
644  E' 1/true : otherwise\n' ||
645  E' Default is 1/true for all observations\n' ||
646  E' Can also be an SQL expression: ''dependent_variable < 10'') \n' ||
647  E' );\n' ||
648  E'\n' ||
649  E'Output:\n' ||
650  E'-----------------------------------------------------------------------------------------\n' ||
651  E' The output table (''output_table'' above) has the following columns\n' ||
652  E' ''coef'' DOUBLE PRECISION[], -- Coefficients of regression \n' ||
653  E' ''std_err'' DOUBLE PRECISION[], -- Standard errors\n' ||
654  E' ''z_stats'' DOUBLE PRECISION[], -- z-stats of the standard errors\n' ||
655  E' ''p_values'' DOUBLE PRECISION[], -- p-values of the standard errors\n' ||
656  E'\n' ||
657  E'';
658  ELSE
659  insert_string := 'No such option. Run SELECT {schema_madlib}.cox_prop_hazards()';
660  END IF;
661  RETURN insert_string;
662 END;
663 $$ LANGUAGE plpgsql VOLATILE;