User Documentation
 All Files Functions Groups
Cox-Proportional Hazards Regression
+ Collaboration diagram for Cox-Proportional Hazards Regression:
About:
Proportional-Hazard models enable the comparison of various survival models. These survival models are functions describing the probability of an one-item event (prototypically, this event is death) with respect to time. The interval of time before death occurs is the survival time. Let T be a random variable representing the survival time, with a cumulative probability function P(t). Informally, P(t) is the probability that death has happened before time t.

Generally, applications start with a list of \( \boldsymbol n \) observations, each with \( \boldsymbol m \) covariates and a time of death. From this \( \boldsymbol n \times m \) matrix, we would like to derive the correlation between the covariates and the hazard function. This amounts to finding the parameters \( \boldsymbol \beta \) that best fit the model described below.

Let us define:

Note that this model does not include a constant term, and the data cannot contain a column of 1s.

By definition,

\[ P[T_k = t_i | \boldsymbol R(t_i)] = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}. \,. \]

The partial likelihood function can now be generated as the product of conditional probabilities:

\[ \mathcal L = \prod_{i = 1}^n \left( \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}} \right). \]

The log-likelihood form of this equation is

\[ L = \sum_{i = 1}^n \left[ \beta^T x_i - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right) \right]. \]

Using this score function and Hessian matrix, the partial likelihood can be maximized using the Newton-Raphson algorithm . Breslow's method is used to resolved tied times of deaths. The time of death for two records are considered "equal" if they differ by less than 1.0e-6

The inverse of the Hessian matrix, evaluated at the estimate of \( \boldsymbol \beta \), can be used as an approximate variance-covariance matrix for the estimate, and used to produce approximate standard errors for the regression coefficients.

\[ \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii} \,. \]

The Wald z-statistic is

\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]

The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]

where \( Z \) is a standard normally distributed random variable.

The condition number is computed as \( \kappa(H) \) during the iteration immediately preceding convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} sourceName (
    inputTable VARCHAR,
    outputTable VARCHAR,
    dependentVariable VARCHAR,
    independentVariable VARCHAR,
    [rightCensoringStatus VARCHAR]
)

Note: Dependent Variables refer to the time of death. There is no need to pre-sort the data.

NOTE2:'right_censoring_status' is set to TRUE to if the observation is not censored and 'FALSE' if the observation is censored. The default value for 'right_censoring_status' is TRUE for all observatoions

Usage:

The Full Interface

SELECT madlib.cox_prop_hazards(
    'source_table',            -- name of input table, VARCHAR
    'out_table',               -- name of output table, VARCHAR
    'dependent_varname',       -- name of dependent variable, VARCHAR
    'independent_varname',     -- name of independent variable, VARCHAR
    ['right_censoring_status', -- name of column with right censoring status, VARCHAR (OPTIONAL, default=True)
);

Here the 'right_censoring_status' can be the name of a column, which contains array of boolean values. It can also have a format of string 'dependent_variable < 10', where x1, x2 and x3 are all column names.

Here the 'independent_varname' can be the name of a column, which contains array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]', where x1, x2 and x3 are all column names.

Output is stored in the out_table:

[ coef | std_err | stats | p_values |
+------+---------+-------+----------+
  1. For function summary information. Run
    sql> select cox_prop_hazards('help');
    OR
    sql> select cox_prop_hazards();
    OR
    sql> select cox_prop_hazards('?');
    
  2. For function usage information.
    sql> select cox_prop_hazards('usage');
    

Note: The function cox_prop_hazards_regr has been deprecated but maintained

Examples:
  1. Create the sample data set:
    sql> SELECT * FROM data;
          val   | time | status
    ------------|--------------
     {0,1.95}   |  35  |  t
     {0,2.20}   |  28  |  t
     {1,1.45}   |  32  |  t
     {1,5.25}   |  31  |  t
     {1,0.38}   |  21  |  t
    ...
    
  2. Run the cox regression function:
    sql> SELECT * FROM cox_prop_hazards('data', 'result_table', 'val', 'time', 'status');
    sql> SELECT * from result_table;
    --------------|--------------------------------------------------------------
    coef           | {0.881089349817059,-0.0756817768938055}
    std_err        | {1.16954914708414,0.338426252282655}
    z_stats        | {0.753356711368689,-0.223628410729811}
    p_values       | {0.451235588326831,0.823046454908087}
Literature:

A somewhat random selection of nice write-ups, with valuable pointers into further literature:

[1] John Fox: Cox Proportional-Hazards Regression for Survival Data, Appendix to An R and S-PLUS companion to Applied Regression Feb 2012, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf

[2] Stephen J Walters: What is a Cox model? http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf

Note
Source and column names have to be passed as strings (due to limitations of the SQL syntax).
See Also
File cox_prop_hazards.sql_in (documenting the SQL functions)