MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file cox_prop_hazards.sql_in 00004 * 00005 * @brief SQL functions for cox proportional hazards 00006 * @date July 2012 00007 * 00008 * @sa For a brief introduction to cox regression, see the 00009 * module description \ref grp_cox_prop_hazards 00010 * 00011 *//* ----------------------------------------------------------------------- */ 00012 00013 m4_include(`SQLCommon.m4') 00014 00015 /** 00016 @addtogroup grp_cox_prop_hazards 00017 00018 @about 00019 Proportional-Hazard models enable the comparison of various survival models. 00020 These survival models are functions describing the probability of an one-item 00021 event (prototypically, this event is death) with respect to time. 00022 The interval of time before death occurs is the survival time. 00023 Let T be a random variable representing the survival time, 00024 with a cumulative probability function P(t). Informally, P(t) is 00025 the probability that death has happened before time t. 00026 00027 Generally, applications start with a list of \f$ \boldsymbol n \f$ observations, 00028 each with \f$ \boldsymbol m \f$ covariates and a time of death. From this 00029 \f$ \boldsymbol n \times m \f$ matrix, we would like to derive the correlation 00030 between the covariates and the hazard function. This amounts to finding 00031 the parameters \f$ \boldsymbol \beta \f$ that best fit the model described below. 00032 00033 Let us define: 00034 - \f$ \boldsymbol t \in \mathbf R^{m} \f$ denote the vector of observed dependent 00035 variables, with \f$ n \f$ rows. 00036 - \f$ X \in \mathbf R^{m} \f$ denote the design matrix with \f$ m \f$ 00037 columns and \f$ n \f$ rows, containing all observed vectors of independent 00038 variables \f$ \boldsymbol x_i \f$ as rows. 00039 - \f$ R(t_i) \f$ denote the set of observations still alive at time \f$ t_i \f$ 00040 00041 Note that this model <b>does not</b> include a <b>constant term</b>, and the data 00042 cannot contain a column of 1s. 00043 00044 By definition, 00045 \f[ 00046 P[T_k = t_i | \boldsymbol R(t_i)] 00047 = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}. 00048 \,. 00049 \f] 00050 00051 The <b>partial likelihood </b>function can now be generated as the product of 00052 conditional probabilities: 00053 \f[ 00054 \mathcal L = \prod_{i = 1}^n 00055 \left( 00056 \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}} 00057 \right). 00058 \f] 00059 00060 The log-likelihood form of this equation is 00061 \f[ 00062 L = \sum_{i = 1}^n 00063 \left[ \beta^T x_i 00064 - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right) 00065 \right]. 00066 \f] 00067 00068 Using this score function and Hessian matrix, the partial likelihood can be 00069 maximized using the <b> Newton-Raphson algorithm </b>.<b> Breslow's method </b> 00070 is used to resolved tied times of deaths. The time of death for two records are 00071 considered "equal" if they differ by less than 1.0e-6 00072 00073 The inverse of the Hessian matrix, evaluated at the estimate of 00074 \f$ \boldsymbol \beta \f$, can be used as an <b>approximate variance-covariance 00075 matrix </b> for the estimate, and used to produce approximate 00076 <b>standard errors</b> for the regression coefficients. 00077 00078 \f[ 00079 \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii} 00080 \,. 00081 \f] 00082 The Wald z-statistic is 00083 \f[ 00084 z_i = \frac{c_i}{\mathit{se}(c_i)} 00085 \,. 00086 \f] 00087 00088 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under 00089 the assumptions inherent in the Wald test) of seeing a value at least as extreme 00090 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is 00091 true. Letting \f$ F \f$ denote the cumulative density function of a standard 00092 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is 00093 therefore 00094 \f[ 00095 p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) 00096 \f] 00097 where \f$ Z \f$ is a standard normally distributed random variable. 00098 00099 00100 The condition number is computed as \f$ \kappa(H) \f$ during the iteration 00101 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using 00102 the coefficients of the previous iteration). A large condition number (say, more 00103 than 1000) indicates the presence of significant multicollinearity. 00104 00105 00106 @input 00107 00108 The training data is expected to be of the following form:\n 00109 <pre>{TABLE|VIEW} <em>sourceName</em> ( 00110 ... 00111 <em>dependentVariable</em> FLOAT8, 00112 <em>independentVariables</em> FLOAT8[], 00113 ... 00114 )</pre> 00115 Note: Dependent Variables refer to the time of death. There is no need to 00116 pre-sort the data. Additionally, all the data is assumed 00117 00118 00119 @usage 00120 - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic 00121 statistics:\n 00122 <pre>SELECT * FROM \ref cox_prop_hazards( 00123 '<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>' 00124 [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ] 00125 );</pre> 00126 Output: 00127 Output: 00128 <pre>coef | log_likelihood | std_err | z_stats | p_values | condition_no | num_iterations 00129 ... 00130 </pre> 00131 - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n 00132 <pre>SELECT (\ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre> 00133 - Get a subset of the output columns, e.g., only the array of coefficients 00134 \f$ \boldsymbol \beta \f$, the log-likelihood of determination: 00135 <pre>SELECT coef, log_likelihood 00136 FROM \ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>independentVariables</em>');</pre> 00137 00138 @examp 00139 00140 -# Create the sample data set: 00141 @verbatim 00142 sql> SELECT * FROM data; 00143 val | time 00144 ------------|-------------- 00145 {0,1.95} | 35 00146 {0,2.20} | 28 00147 {1,1.45} | 32 00148 {1,5.25} | 31 00149 {1,0.38} | 21 00150 ... 00151 @endverbatim 00152 -# Run the cox regression function: 00153 @verbatim 00154 sql> SELECT * FROM cox_prop_hazards('data', 'val', 'time', 100, 'newto', 0.001); 00155 ---------------|-------------------------------------------------------------- 00156 coef | {0.881089349817059,-0.0756817768938055} 00157 log_likelihood | -4.46535157957394 00158 std_err | {1.16954914708414,0.338426252282655} 00159 z_stats | {0.753356711368689,-0.223628410729811} 00160 p_values | {0.451235588326831,0.823046454908087} 00161 condition_no | 12.1135391339082 00162 num_iterations | 4 00163 00164 @endverbatim 00165 00166 @literature 00167 00168 A somewhat random selection of nice write-ups, with valuable pointers into 00169 further literature: 00170 00171 [1] John Fox: Cox Proportional-Hazards Regression for Survival Data, 00172 Appendix to An R and S-PLUS companion to Applied Regression Feb 2012, 00173 http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf 00174 00175 [2] Stephen J Walters: What is a Cox model? 00176 http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf 00177 00178 00179 @note Source and column names have to be passed as strings (due to limitations 00180 of the SQL syntax). 00181 00182 00183 @sa File cox_prop_hazards.sql_in (documenting the SQL functions) 00184 00185 @internal 00186 @sa Namespace cox_prop_hazards 00187 \ref madlib::modules::stats documenting the implementation in C++ 00188 @endinternal 00189 00190 */ 00191 00192 00193 DROP TYPE IF EXISTS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result; 00194 CREATE TYPE MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS ( 00195 x DOUBLE PRECISION[], 00196 exp_coef_x DOUBLE PRECISION, 00197 x_exp_coef_x DOUBLE PRECISION[], 00198 x_xTrans_exp_coef_x DOUBLE PRECISION[] 00199 ); 00200 00201 00202 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.intermediate_cox_prop_hazards( 00203 /*+ x */ DOUBLE PRECISION[], 00204 /*+ coef */ DOUBLE PRECISION[]) 00205 RETURNS MADLIB_SCHEMA.intermediate_cox_prop_hazards_result AS 00206 'MODULE_PATHNAME' 00207 LANGUAGE c IMMUTABLE; 00208 00209 00210 00211 DROP TYPE IF EXISTS MADLIB_SCHEMA.cox_prop_hazards_result; 00212 CREATE TYPE MADLIB_SCHEMA.cox_prop_hazards_result AS ( 00213 coef DOUBLE PRECISION[], 00214 logLikelihood DOUBLE PRECISION, 00215 std_err DOUBLE PRECISION[], 00216 z_stats DOUBLE PRECISION[], 00217 p_values DOUBLE PRECISION[], 00218 condition_no DOUBLE PRECISION, 00219 num_iterations INTEGER 00220 ); 00221 00222 00223 00224 00225 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_final( 00226 state DOUBLE PRECISION[]) 00227 RETURNS DOUBLE PRECISION[] 00228 AS 'MODULE_PATHNAME' 00229 LANGUAGE C IMMUTABLE STRICT; 00230 00231 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.cox_prop_hazards_step_transition( 00232 /*+ state */ DOUBLE PRECISION[], 00233 /*+ x */ DOUBLE PRECISION[], 00234 /*+ y */ DOUBLE PRECISION, 00235 /*+ exp_coef_x */ DOUBLE PRECISION, 00236 /*+ xexp_coef_x */ DOUBLE PRECISION[], 00237 /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[], 00238 /*+ previous_state */ DOUBLE PRECISION[]) 00239 RETURNS DOUBLE PRECISION[] AS 00240 'MODULE_PATHNAME' 00241 LANGUAGE C IMMUTABLE; 00242 00243 00244 /** 00245 * @internal 00246 * @brief Perform one iteration the Newton-Rhapson method. 00247 */ 00248 CREATE 00249 m4_ifdef(`__GREENPLUM__',m4_ifdef(`__HAS_ORDERED_AGGREGATES__',`ORDERED')) 00250 AGGREGATE MADLIB_SCHEMA.cox_prop_hazards_step( 00251 00252 /*+ x */ DOUBLE PRECISION[], 00253 /*+ y */ DOUBLE PRECISION, 00254 /*+ exp_coef_x */ DOUBLE PRECISION, 00255 /*+ xexp_coef_x */ DOUBLE PRECISION[], 00256 /*+ x_xTrans_exp_coef_x */ DOUBLE PRECISION[], 00257 /*+ previous_state */ DOUBLE PRECISION[]) ( 00258 STYPE=DOUBLE PRECISION[], 00259 SFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_transition, 00260 FINALFUNC=MADLIB_SCHEMA.cox_prop_hazards_step_final, 00261 INITCOND='{0,0,0,0,0,0,0}' 00262 ); 00263 00264 00265 00266 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_step_distance( 00267 /*+ state1 */ DOUBLE PRECISION[], 00268 /*+ state2 */ DOUBLE PRECISION[]) 00269 RETURNS DOUBLE PRECISION AS 00270 'MODULE_PATHNAME' 00271 LANGUAGE c IMMUTABLE STRICT; 00272 00273 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_cox_prop_hazards_result( 00274 /*+ state */ DOUBLE PRECISION[]) 00275 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS 00276 'MODULE_PATHNAME' 00277 LANGUAGE c IMMUTABLE STRICT; 00278 00279 00280 -- We only need to document the last one (unfortunately, in Greenplum we have to 00281 -- use function overloading instead of default arguments). 00282 CREATE FUNCTION MADLIB_SCHEMA.compute_cox_prop_hazards( 00283 "source" VARCHAR, 00284 "indepColumn" VARCHAR, 00285 "depColumn" VARCHAR, 00286 "maxNumIterations" INTEGER, 00287 "optimizer" VARCHAR, 00288 "precision" DOUBLE PRECISION) 00289 RETURNS INTEGER 00290 AS $$PythonFunction(stats, cox_prop_hazards, compute_cox_prop_hazards)$$ 00291 LANGUAGE plpythonu VOLATILE; 00292 00293 /** 00294 * @brief Compute cox-regression coefficients and diagnostic statistics 00295 * 00296 * To include an intercept in the model, set one coordinate in the 00297 * <tt>independentVariables</tt> array to 1. 00298 * 00299 * @param source Name of the source relation containing the training data 00300 * @param indepColumn Name of the independent column 00301 * @param depColumn Name of the dependant column measuring time of death 00302 * @param maxNumIterations The maximum number of iterations 00303 * @param optimizer The optimizer to use (either 00304 * <tt>'newton'</tt>/<tt>'newton'</tt> for the newton method 00305 * @param precision The difference between log-likelihood values in successive 00306 * iterations that should indicate convergence. Note that a non-positive 00307 * value here disables the convergence criterion, and execution will only 00308 * stop after \c maxNumIterations iterations. 00309 * 00310 * @return A composite value: 00311 * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol \beta \f$ 00312 * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$l(\boldsymbol \beta)\f$ 00313 * - <tt>std_err FLOAT8[]</tt> - Array of standard errors, 00314 * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$ 00315 * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$ 00316 * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$ 00317 * - <tt>condition_no FLOAT8</tt> - The condition number of matrix 00318 * \f$ H \f$ during the iteration immediately <em>preceding</em> 00319 * convergence (i.e., \f$ H \f$ is computed using the coefficients of the 00320 * previous iteration) 00321 * - <tt>num_iterations INTEGER</tt> - The number of iterations before the 00322 * algorithm terminated 00323 * 00324 * - Get vector of coefficients \f$ \boldsymbol \beta \f$ and all diagnostic 00325 * statistics:\n 00326 * <pre>SELECT * FROM \ref cox_prop_hazards( 00327 * '<em>sourceName</em>', '<em>dependentVariable</em>', 00328 * '<em>independentVariables</em>' 00329 * [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ] 00330 * );</pre> 00331 * - Get vector of coefficients \f$ \boldsymbol \beta \f$:\n 00332 * <pre>SELECT (\ref cox_prop_hazards('<em>sourceName</em>', 00333 * '<em>dependentVariable</em>', '<em>independentVariables</em>')).coef;</pre> 00334 * - Get a subset of the output columns, e.g., only the array of coefficients 00335 * \f$ \boldsymbol \beta \f$, the log-likelihood of determination: 00336 * <pre>SELECT coef, log_likelihood 00337 * FROM \ref cox_prop_hazards('<em>sourceName</em>', '<em>dependentVariable</em>', 00338 * '<em>independentVariables</em>');</pre> 00339 */ 00340 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards( 00341 "source" VARCHAR, 00342 "indepColumn" VARCHAR, 00343 "depColumn" VARCHAR, 00344 "maxNumIterations" INTEGER /*+ DEFAULT 20 */, 00345 "optimizer" VARCHAR /*+ DEFAULT 'newton' */, 00346 "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */) 00347 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS $$ 00348 DECLARE 00349 theIteration INTEGER; 00350 fnName VARCHAR; 00351 theResult MADLIB_SCHEMA.cox_prop_hazards_result; 00352 BEGIN 00353 theIteration := ( 00354 SELECT MADLIB_SCHEMA.compute_cox_prop_hazards($1, $2, $3, $4, $5, $6) 00355 ); 00356 IF optimizer = 'newton' THEN 00357 fnName := 'internal_cox_prop_hazards_result'; 00358 ELSE 00359 RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer; 00360 END IF; 00361 EXECUTE 00362 $sql$ 00363 SELECT (result).* 00364 FROM ( 00365 SELECT 00366 MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result 00367 FROM _madlib_iterative_alg 00368 WHERE _madlib_iteration = $sql$ || theIteration || $sql$ 00369 ) subq 00370 $sql$ 00371 INTO theResult; 00372 00373 -- The number of iterations are not updated in the C++ code. We do it here. 00374 IF NOT (theResult IS NULL) THEN 00375 theResult.num_iterations = theIteration; 00376 END IF; 00377 RETURN theResult; 00378 END; 00379 $$ LANGUAGE plpgsql VOLATILE; 00380 00381 00382 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards( 00383 "source" VARCHAR, 00384 "indepColumn" VARCHAR, 00385 "depColumn" VARCHAR) 00386 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS 00387 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, 20, 'newton', 0.0001);$$ 00388 LANGUAGE sql VOLATILE; 00389 00390 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards( 00391 "source" VARCHAR, 00392 "indepColumn" VARCHAR, 00393 "depColumn" VARCHAR, 00394 "maxNumIterations" INTEGER) 00395 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS 00396 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, $4, 'newton', 0.0001);$$ 00397 LANGUAGE sql VOLATILE; 00398 00399 CREATE FUNCTION MADLIB_SCHEMA.cox_prop_hazards( 00400 "source" VARCHAR, 00401 "indepColumn" VARCHAR, 00402 "depColumn" VARCHAR, 00403 "maxNumIterations" INTEGER, 00404 "optimizer" VARCHAR) 00405 RETURNS MADLIB_SCHEMA.cox_prop_hazards_result AS 00406 $$SELECT MADLIB_SCHEMA.cox_prop_hazards($1, $2, $3, $4, $5, 0.0001);$$ 00407 LANGUAGE sql VOLATILE;