User Documentation
cox_prop_hazards.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//** 
00002  *
00003  * @file cox_prop_hazards.sql_in
00004  *
00005  * @brief SQL functions for cox proportional hazards
00006  * @date July 2012
00007  *
00008  * @sa For a brief introduction to cox regression, see the
00009  *     module description \ref grp_cox_prop_hazards
00010  *
00011  *//* ----------------------------------------------------------------------- */
00012 
00013 m4_include(`SQLCommon.m4')
00014 
00015 /**
00016 @addtogroup grp_cox_prop_hazards
00017 
00018 @about
00019 Proportional-Hazard models enable the comparison of various survival models. 
00020 These survival models are functions describing the probability of an one-item 
00021 event (prototypically, this event is death) with respect to time. 
00022 The interval of time before death occurs is the survival time. 
00023 Let T be a random variable representing the survival time, 
00024 with a cumulative probability function P(t). Informally, P(t) is 
00025 the probability that death has happened before time t.
00026 
00027 Generally, applications start with a list of \f$ \boldsymbol n \f$ observations, 
00028 each with \f$ \boldsymbol m \f$ covariates and a time of death. From this 
00029 \f$ \boldsymbol n \times m \f$ matrix, we would like to derive the correlation 
00030 between the covariates and the hazard function. This amounts to finding 
00031 the parameters \f$ \boldsymbol \beta \f$ that best fit the model described below.
00032 
00033 Let us define:
00034 - \f$ \boldsymbol t \in  \mathbf R^{m} \f$ denote the vector of observed dependent
00035   variables, with \f$ n \f$ rows.
00036 - \f$ X \in \mathbf R^{m} \f$ denote the design matrix with \f$ m \f$
00037   columns and \f$ n \f$ rows, containing all observed vectors of independent
00038   variables \f$ \boldsymbol x_i \f$ as rows.
00039 - \f$ R(t_i) \f$ denote the set of observations still alive at time \f$ t_i \f$
00040 
00041 Note that this model <b>does not</b> include a <b>constant term</b>, and the data
00042 cannot contain a column of 1s.
00043 
00044 By definition,
00045 \f[
00046     P[T_k = t_i | \boldsymbol R(t_i)]
00047     =    \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}.
00048     \,.
00049 \f]
00050 
00051 The <b>partial likelihood </b>function can now be generated as the product of 
00052 conditional probabilities:
00053 \f[
00054 \mathcal L = \prod_{i = 1}^n 
00055                 \left(
00056                 \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}
00057                 \right).
00058 \f]
00059 
00060 The log-likelihood form of this equation is
00061 \f[
00062 L = \sum_{i = 1}^n 
00063                     \left[  \beta^T x_i
00064                                     - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right)
00065                     \right].
00066 \f]
00067 
00068 Using this score function and Hessian matrix, the partial likelihood can be 
00069 maximized using the <b> Newton-Raphson algorithm </b>.<b> Breslow's method </b> 
00070 is used to resolved tied times of deaths. The time of death for two records are 
00071 considered "equal" if they differ by less than 1.0e-6
00072 
00073 The inverse of the Hessian matrix, evaluated at the estimate of 
00074 \f$ \boldsymbol \beta \f$, can be used as an <b>approximate variance-covariance 
00075 matrix </b> for the estimate, and used to produce approximate 
00076 <b>standard errors</b> for the regression coefficients.
00077 
00078 \f[
00079     \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii}
00080     \,.
00081 \f]
00082 The Wald z-statistic is
00083 \f[
00084     z_i = \frac{c_i}{\mathit{se}(c_i)}
00085     \,.
00086 \f]
00087 
00088 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
00089 the assumptions inherent in the Wald test) of seeing a value at least as extreme
00090 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
00091 true. Letting \f$ F \f$ denote the cumulative density function of a standard
00092 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
00093 therefore
00094 \f[
00095     p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
00096 \f]
00097 where \f$ Z \f$ is a standard normally distributed random variable.
00098 
00099 
00100 The condition number is computed as \f$ \kappa(H) \f$ during the iteration
00101 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
00102 the coefficients of the previous iteration). A large condition number (say, more
00103 than 1000) indicates the presence of significant multicollinearity.
00104 
00105 
00106 @input
00107 
00108 The training data is expected to be of the following form:\n
00109 <pre>{TABLE|VIEW} <em>sourceName</em> (
00110     ...
00111     <em>dependentVariable</em> FLOAT8,
00112     <em>independentVariables</em> FLOAT8[],
00113     ...
00114 )</pre>
00115 Note: Dependent Variables refer to the time of death. There is no need to
00116 pre-sort the data. Additionally, all the data is assumed
00117 
00118 
00119 @usage
00120 - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic
00121   statistics:\n
00122   <pre>SELECT * FROM \ref cox_prop_hazards(
00123     '<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>'
00124     [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ]
00125 );</pre>
00126   Output:
00127   Output:
00128   <pre>coef | log_likelihood | std_err | z_stats | p_values  | condition_no | num_iterations
00129                                                ...
00130 </pre>
00131 - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n
00132   <pre>SELECT (\ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre>
00133 - Get a subset of the output columns, e.g., only the array of coefficients
00134   \f$ \boldsymbol \beta \f$, the log-likelihood of determination:
00135   <pre>SELECT coef, log_likelihood
00136 FROM \ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>');</pre>
00137 
00138 @examp
00139 
00140 -# Create the sample data set:
00141 @verbatim 
00142 sql> SELECT * FROM data;
00143       val   | time
00144 ------------|--------------
00145  {0,1.95}   | 35
00146  {0,2.20}   | 28
00147  {1,1.45}   | 32
00148  {1,5.25}   | 31
00149  {1,0.38}   | 21
00150 ...
00151 @endverbatim
00152 -# Run the cox regression function:
00153 @verbatim
00154 sql> SELECT * FROM cox_prop_hazards('data', 'val', 'time', 100, 'newto', 0.001);
00155 ---------------|--------------------------------------------------------------
00156 coef           | {0.881089349817059,-0.0756817768938055}
00157 log_likelihood | -4.46535157957394
00158 std_err        | {1.16954914708414,0.338426252282655}
00159 z_stats        | {0.753356711368689,-0.223628410729811}
00160 p_values       | {0.451235588326831,0.823046454908087}
00161 condition_no   | 12.1135391339082
00162 num_iterations | 4
00163 
00164 @endverbatim
00165 
00166 @literature
00167 
00168 A somewhat random selection of nice write-ups, with valuable pointers into
00169 further literature:
00170 
00171 [1] John Fox: Cox Proportional-Hazards Regression for Survival Data,
00172         Appendix to An R and S-PLUS companion to Applied Regression Feb 2012,
00173     http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf
00174 
00175 [2] Stephen J Walters: What is a Cox model?
00176     http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf
00177 
00178 
00179 @note Source and column names have to be passed as strings (due to limitations 
00180 of the SQL syntax).
00181 
00182 
00183 @sa File cox_prop_hazards.sql_in (documenting the SQL functions)
00184 
00185 @internal
00186 @sa Namespace cox_prop_hazards 
00187     \ref madlib::modules::stats documenting the implementation in C++
00188 @endinternal
00189 
00190 */
00191 
00192 
00193 DROP TYPE IF EXISTS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result;
00194 CREATE TYPE MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS (
00195     x DOUBLE PRECISION[],
00196     exp_coef_x DOUBLE PRECISION,
00197     x_exp_coef_x DOUBLE PRECISION[],
00198     x_xTrans_exp_coef_x DOUBLE PRECISION[]
00199 );
00200 
00201 
00202 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.intermediate_cox_prop_hazards(
00203     /*+  x */ DOUBLE PRECISION[],
00204     /*+  coef */ DOUBLE PRECISION[])
00205 RETURNS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS
00206 'MODULE_PATHNAME'
00207 LANGUAGE c IMMUTABLE;
00208 
00209 
00210 
00211 DROP TYPE IF EXISTS MADLIB_SCHEMA.cox_prop_hazards_result;
00212 CREATE TYPE MADLIB_SCHEMA.cox_prop_hazards_result AS (
00213     coef DOUBLE PRECISION[],
00214     logLikelihood DOUBLE PRECISION,
00215     std_err DOUBLE PRECISION[],
00216     z_stats DOUBLE PRECISION[],
00217     p_values DOUBLE PRECISION[],
00218     condition_no DOUBLE PRECISION,
00219     num_iterations INTEGER
00220 );
00221 
00222 
00223 
00224 
00225 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_final(
00226     state DOUBLE PRECISION[])
00227 RETURNS DOUBLE PRECISION[]
00228 AS 'MODULE_PATHNAME'
00229 LANGUAGE C IMMUTABLE STRICT;
00230 
00231 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_transition(
00232     /*+  state */ DOUBLE PRECISION[],
00233     /*+  x */ DOUBLE PRECISION[],
00234     /*+  y */ DOUBLE PRECISION,     
00235     /*+  exp_coef_x */ DOUBLE PRECISION,
00236     /*+  xexp_coef_x */ DOUBLE PRECISION[],
00237     /*+  x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
00238     /*+  previous_state */ DOUBLE PRECISION[])
00239 RETURNS DOUBLE PRECISION[] AS 
00240 'MODULE_PATHNAME'
00241 LANGUAGE C IMMUTABLE;
00242 
00243 
00244 /**
00245  * @internal
00246  * @brief Perform one iteration the Newton-Rhapson method.
00247  */
00248 CREATE
00249 m4_ifdef(`__GREENPLUM__',m4_ifdef(`__HAS_ORDERED_AGGREGATES__',`ORDERED'))
00250 AGGREGATE MADLIB_SCHEMA.cox_prop_hazards_step(
00251 
00252     /*+  x */ DOUBLE PRECISION[],
00253     /*+  y */ DOUBLE PRECISION,
00254     /*+  exp_coef_x */ DOUBLE PRECISION,
00255     /*+  xexp_coef_x */ DOUBLE PRECISION[],
00256     /*+  x_xTrans_exp_coef_x */ DOUBLE PRECISION[],
00257     /*+ previous_state */ DOUBLE PRECISION[]) (    
00258     STYPE=DOUBLE PRECISION[],
00259     SFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_transition,
00260     FINALFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_final,
00261     INITCOND='{0,0,0,0,0,0,0}'
00262 );
00263 
00264 
00265 
00266 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_step_distance(
00267     /*+ state1 */ DOUBLE PRECISION[],
00268     /*+ state2 */ DOUBLE PRECISION[])
00269 RETURNS DOUBLE PRECISION AS
00270 'MODULE_PATHNAME'
00271 LANGUAGE c IMMUTABLE STRICT;
00272 
00273 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_result(
00274     /*+ state */ DOUBLE PRECISION[])
00275 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
00276 'MODULE_PATHNAME'
00277 LANGUAGE c IMMUTABLE STRICT;
00278 
00279 
00280 -- We only need to document the last one (unfortunately, in Greenplum we have to
00281 -- use function overloading instead of default arguments).
00282 CREATE FUNCTION MADLIB_SCHEMA.compute_cox_prop_hazards(
00283     "source" VARCHAR,
00284     "indepColumn" VARCHAR,
00285     "depColumn" VARCHAR,
00286     "maxNumIterations" INTEGER,
00287     "optimizer" VARCHAR,
00288     "precision" DOUBLE PRECISION)
00289 RETURNS INTEGER
00290 AS $$PythonFunction(stats, cox_prop_hazards, compute_cox_prop_hazards)$$
00291 LANGUAGE plpythonu VOLATILE;
00292 
00293 /**
00294  * @brief Compute cox-regression coefficients and diagnostic statistics
00295  *
00296  * To include an intercept in the model, set one coordinate in the
00297  * <tt>independentVariables</tt> array to 1.
00298  * 
00299  * @param source Name of the source relation containing the training data
00300  * @param indepColumn Name of the independent column
00301  * @param depColumn Name of the dependant column measuring time of death
00302  * @param maxNumIterations The maximum number of iterations
00303  * @param optimizer The optimizer to use (either
00304  *        <tt>'newton'</tt>/<tt>'newton'</tt> for the newton method
00305  * @param precision The difference between log-likelihood values in successive
00306  *        iterations that should indicate convergence. Note that a non-positive
00307  *        value here disables the convergence criterion, and execution will only
00308  *        stop after \c maxNumIterations iterations.
00309  *
00310  * @return A composite value:
00311  *  - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol \beta \f$
00312  *  - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$l(\boldsymbol \beta)\f$
00313  *  - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
00314  *    \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
00315  *  - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
00316  *  - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
00317  *  - <tt>condition_no FLOAT8</tt> - The condition number of matrix
00318  *    \f$ H \f$ during the iteration immediately <em>preceding</em>
00319  *    convergence (i.e., \f$ H \f$ is computed using the coefficients of the
00320  *    previous iteration)
00321  *  - <tt>num_iterations INTEGER</tt> - The number of iterations before the
00322  *    algorithm terminated
00323  *
00324  * - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic
00325  *  statistics:\n
00326  *  <pre>SELECT * FROM \ref cox_prop_hazards(
00327  *    '<em>sourceName</em>', '<em>dependentVariable</em>', 
00328  *      '<em>independentVariables</em>'
00329  *    [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ]
00330  * );</pre>
00331  * - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n
00332  *  <pre>SELECT (\ref cox_prop_hazards('<em>sourceName</em>', 
00333  * '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre>
00334  * - Get a subset of the output columns, e.g., only the array of coefficients
00335  *  \f$ \boldsymbol \beta \f$, the log-likelihood of determination:
00336  *  <pre>SELECT coef, log_likelihood
00337  * FROM \ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>',
00338  * '<em>independentVariables</em>');</pre>
00339  */
00340 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
00341     "source" VARCHAR,
00342     "indepColumn" VARCHAR,
00343     "depColumn" VARCHAR,
00344     "maxNumIterations" INTEGER /*+ DEFAULT 20 */,
00345     "optimizer" VARCHAR /*+ DEFAULT 'newton' */,
00346     "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */)
00347 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$
00348 DECLARE
00349     theIteration INTEGER;
00350     fnName VARCHAR;
00351     theResult MADLIB_SCHEMA.cox_prop_hazards_result;
00352 BEGIN
00353     theIteration := (
00354         SELECT MADLIB_SCHEMA.compute_cox_prop_hazards($1, $2, $3, $4, $5, $6)
00355     );
00356     IF optimizer = 'newton' THEN
00357         fnName := 'internal_cox_prop_hazards_result';
00358     ELSE
00359         RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer;
00360     END IF;
00361     EXECUTE
00362         $sql$
00363         SELECT (result).*
00364         FROM (
00365             SELECT
00366                 MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result
00367                 FROM _madlib_iterative_alg
00368                 WHERE _madlib_iteration = $sql$ || theIteration || $sql$
00369             ) subq
00370         $sql$
00371         INTO theResult;
00372                 
00373     -- The number of iterations are not updated in the C++ code. We do it here.
00374     IF NOT (theResult IS NULL) THEN
00375         theResult.num_iterations = theIteration;
00376     END IF;
00377     RETURN theResult;
00378 END;
00379 $$ LANGUAGE plpgsql VOLATILE;
00380 
00381 
00382 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
00383     "source" VARCHAR,
00384     "indepColumn" VARCHAR,
00385     "depColumn" VARCHAR)
00386 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
00387 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, 20, 'newton', 0.0001);$$
00388 LANGUAGE sql VOLATILE;
00389 
00390 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
00391     "source" VARCHAR,
00392     "indepColumn" VARCHAR,
00393     "depColumn" VARCHAR,
00394     "maxNumIterations" INTEGER)
00395 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
00396 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, $4, 'newton', 0.0001);$$
00397 LANGUAGE sql VOLATILE;
00398 
00399 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards(
00400     "source" VARCHAR,
00401     "indepColumn" VARCHAR,
00402     "depColumn" VARCHAR,
00403     "maxNumIterations" INTEGER,
00404     "optimizer" VARCHAR)
00405 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS
00406 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, $4, $5, 0.0001);$$
00407 LANGUAGE sql VOLATILE;