Contents

Proportional-Hazard models enable the comparison of various survival models. These survival models are functions describing the probability of a one-item event (prototypically, this event is death) with respect to time. The interval of time before the occurrence of death can be called the survival time. Let T be a random variable representing the survival time, with a cumulative probability function P(t). Informally, P(t) is the probability that death has happened before time t.

Training Function

Following is the syntax for the coxph_train() training function:

coxph_train( source_table,
             output_table,
             dependent_variable,
             independent_variable,
             right_censoring_status,
             strata,
             optimizer_params
           )

Arguments

source_table

TEXT. The name of the table containing input data.

output_table

TEXT. The name of the table where the output model is saved. The output is saved in the table named by the output_table argument. It has the following columns:

coef	FLOAT8[]. Vector of the coefficients.
loglikelihood	FLOAT8. Log-likelihood value of the MLE estimate.
std_err	FLOAT8[]. Vector of the standard error of the coefficients.
stats	FLOAT8[]. Vector of the statistics of the coefficients.
p_values	FLOAT8[]. Vector of the p-values of the coefficients.
hessian	FLOAT8[]. The Hessian matrix computed using the final solution.
num_iterations	INTEGER. The number of iterations performed by the optimizer.

Additionally, a summary output table is generated that contains a summary of the parameters used for building the Cox model. It is stored in a table named <output_table>_summary. It has the following columns:

source_table	The source table name.
dependent_variable	The dependent variable name.
independent_variable	The independent variable name.
right_censoring_status	The right censoring status
strata	The stratification columns
num_processed	The number of rows that were actually used in the computation.
num_missing_rows_skipped	The number of rows that were skipped in the computation due to NULL values in them.

dependent_variable

TEXT. A string containing the name of a column that contains an array of numeric values, or a string expression in the format 'ARRAY[1, x1, x2, x3]', where x1, x2 and x3 are column names. Dependent variables refer to the time of death. There is no need to pre-sort the data.

independent_variable

TEXT. The name of the independent variable.

right_censoring_status (optional)

TEXT, default: TRUE for all observations. A string containing an expression that evaluates to the right-censoring status for the observation—TRUE if the observation is not censored and FALSE if the observation is censored. The string could contain the name of the column containing the right-censoring status, a fixed Boolean expression (i.e., 'true', 'false', '0', '1') that applies to all observations, or a Boolean expression such as 'column_name < 10' that can be evaluated for each observation.

strata (optional)

VARCHAR, default: NULL, which does not do any stratifications. A string of comma-separated column names that are the strata ID variables used to do stratification.

optimizer_params (optional)

VARCHAR, default: NULL, which uses the default values of optimizer parameters: max_iter=100, optimizer=newton, tolerance=1e-8, array_agg_size=10000000, sample_size=1000000. It should be a string that contains 'key=value' pairs separated by commas. The meanings of these parameters are:

max_iter — The maximum number of iterations. The computation stops if the number of iterations exceeds this, which usually means that there is no convergence.
optimizer — The optimization method. Right now, "newton" is the only one supported.
tolerance — The stopping criteria. When the difference between the log-likelihoods of two consecutive iterations is smaller than this number, the computation has already converged and stops.
array_agg_size — To speed up the computation, the original data table is cut into multiple pieces, and each pieces of the data is aggregated into one big row. In the process of computation, the whole big row is loaded into memory and thus speed up the computation. This parameter controls approximately how many numbers we want to put into one big row. Larger value of array_agg_size may speed up more, but the size of the big row cannot exceed 1GB due to the restriction of PostgreSQL databases.
sample_size — To cut the data into approximate equal pieces, we first sample the data, and then find out the break points using this sampled data. A larger sample_size produces more accurate break points.

Proportional Hazards Assumption Test Function

The cox_zph() function tests the proportional hazards assumption (PHA) of a Cox regression.

Proportional-hazard models enable the comparison of various survival models. These PH models, however, assume that the hazard for a given individual is a fixed proportion of the hazard for any other individual, and the ratio of the hazards is constant across time. MADlib does not currently have support for performing any transformation of the time to compute the correlation.

The cox_zph() function is used to test this assumption by computing the correlation of the residual of the coxph_train() model with time.

Following is the syntax for the cox_zph() function:

cox_zph(cox_model_table, output_table)

Arguments

cox_model_table

TEXT. The name of the table containing the Cox Proportional-Hazards model.

output_table

TEXT. The name of the table where the test statistics are saved. The output table is named by the output_table argument and has the following columns:

covariate	TEXT. The independent variables.
rho	FLOAT8[]. Vector of the correlation coefficients between survival time and the scaled Schoenfeld residuals.
chi_square	FLOAT8[]. Chi-square test statistic for the correlation analysis.
p_value	FLOAT8[]. Two-side p-value for the chi-square statistic.

Additionally, the residual values are outputted to the table named output_table_residual. The table contains the following columns:

<dep_column_name>	FLOAT8. Time values (dependent variable) present in the original source table.
residual	FLOAT8[]. Difference between the original covariate values and the expectation of the covariates obtained from the coxph_train model.
scaled_residual	Residual values scaled by the variance of the coefficients.

Notes

Table names can be optionally schema qualified (current_schemas() is used if a schema name is not provided) and table and column names should follow case-sensitivity and quoting rules per the database. For instance, 'mytable' and 'MyTable' both resolve to the same entity—'mytable'. If mixed-case or multi-byte characters are desired for entity names then the string should be double-quoted; in this case the input would be '"MyTable"'.
The cox_prop_hazards_regr() and cox_prop_hazards() functions have been deprecated; coxph_train() should be used instead.

Prediction Function

The prediction function is provided to calculate the linear predictionors, risk or the linear terms for the given prediction data. It has the following syntax:

coxph_predict(model_table,
              source_table,
              id_col_name,
              output_table,
              pred_type,
              reference)

Arguments

model_table

TEXT. Name of the table containing the cox model.

source_table

TEXT. Name of the table containing the prediction data.

id_col_name

TEXT. Name of the id column in the source table.

output_table

TEXT. Name of the table to store the prediction results in. The output table is named by the output_table argument and has the following columns:

id	TEXT. The id column name from the source table.
predicted_result	DOUBLE PRECISION. Result of prediction based of the value of the prediction type parameter.

pred_type

TEXT, OPTIONAL. Type of prediction. This can be one of 'linear_predictors', 'risk', or 'terms'. DEFAULT='linear_predictors'.

'linear_predictors' calculates the dot product of the independent variables and the coefficients.
'risk' is the exponentiated value of the linear prediction.
'terms' correspond to the linear terms obtained by multiplying the independent variables with their corresponding coefficients values (without further calculating the sum of these terms)

reference

TEXT, OPTIONAL. Reference level to use for centering predictions. Can be one of 'strata', 'overall'. DEFAULT='strata'. Note that R uses 'sample' instead of 'overall' when referring to the overall mean value of the covariates as being the reference level.

Examples

Create an input data set.

DROP TABLE IF EXISTS sample_data;
CREATE TABLE sample_data (
    id INTEGER NOT NULL,
    grp DOUBLE PRECISION,
    wbc DOUBLE PRECISION,
    timedeath INTEGER,
    status BOOLEAN
);
COPY sample_data FROM STDIN WITH DELIMITER '|';
  0 |   0 | 1.45 |        35 | t
  1 |   0 | 1.47 |        34 | t
  3 |   0 |  2.2 |        32 | t
  4 |   0 | 1.78 |        25 | t
  5 |   0 | 2.57 |        23 | t
  6 |   0 | 2.32 |        22 | t
  7 |   0 | 2.01 |        20 | t
  8 |   0 | 2.05 |        19 | t
  9 |   0 | 2.16 |        17 | t
 10 |   0 |  3.6 |        16 | t
 11 |   1 |  2.3 |        15 | t
 12 |   0 | 2.88 |        13 | t
 13 |   1 |  1.5 |        12 | t
 14 |   0 |  2.6 |        11 | t
 15 |   0 |  2.7 |        10 | t
 16 |   0 |  2.8 |         9 | t
 17 |   1 | 2.32 |         8 | t
 18 |   0 | 4.43 |         7 | t
 19 |   0 | 2.31 |         6 | t
 20 |   1 | 3.49 |         5 | t
 21 |   1 | 2.42 |         4 | t
 22 |   1 | 4.01 |         3 | t
 23 |   1 | 4.91 |         2 | t
 24 |   1 |    5 |         1 | t
\.

Run the Cox regression function.

DROP TABLE IF EXISTS sample_cox, sample_cox_summary;
SELECT madlib.coxph_train( 'sample_data',
                           'sample_cox',
                           'timedeath',
                           'ARRAY[grp,wbc]',
                           'status'
                         );

View the results of the regression.

\x on
SELECT * FROM sample_cox;

Results:

-[ RECORD 1 ]--+----------------------------------------------------------------------------
coef           | {2.54407073265254,1.67172094779487}
loglikelihood  | -37.8532498733
std_err        | {0.677180599294897,0.387195514577534}
z_stats        | {3.7568570855419,4.31751114064138}
p_values       | {0.000172060691513886,1.5779844638453e-05}
hessian        | {{2.78043065745617,-2.25848560642414},{-2.25848560642414,8.50472838284472}}
num_iterations | 5

Computing predictions using cox model. (This example uses the original data table to perform the prediction. Typically a different test dataset with the same features as the original training dataset would be used.)

\x off
-- Display the linear predictors for the original dataset
DROP TABLE IF EXISTS sample_pred;
SELECT madlib.coxph_predict('sample_cox',
                            'sample_data',
                            'id',
                            'sample_pred');
SELECT * FROM sample_pred;

 id |  predicted_value
----+--------------------
  0 |  -2.97110918125034
  4 |  -2.41944126847803
  6 |   -1.5167119566688
  8 |  -1.96807661257341
 10 |  0.623090856508638
 12 |  -0.58054822590367
 14 |  -1.04863009128623
 16 | -0.714285901727259
 18 |   2.01061924317838
 20 |   2.98327228490375
 22 |   3.85256717775708
 24 |     5.507570916074
  1 |  -2.93767476229444
  3 |  -1.71731847040418
  5 |  -1.09878171972008
  7 |  -2.03494545048521
  9 |  -1.78418730831598
 15 | -0.881457996506747
 19 |  -1.53342916614675
 11 |  0.993924357027849
 13 | -0.343452401208048
 17 |   1.02735877598375
 21 |   1.19453087076323
 23 |   5.35711603077246
(24 rows)

-- Display the relative risk for the original dataset
DROP TABLE IF EXISTS sample_pred;
SELECT madlib.coxph_predict('sample_cox',
                            'sample_data',
                            'id',
                            'sample_pred',
                            'risk');
SELECT * FROM sample_pred;

 id |  predicted_value
 ----+--------------------
  1 | 0.0529887971503509
  3 |  0.179546963459175
  5 |   0.33327686110022
  7 |  0.130687611255372
  9 |  0.167933483703554
 15 |  0.414178600294289
 19 |  0.215794402223054
 11 |   2.70181658768287
 13 |  0.709317242984782
 17 |   2.79367735395696
 21 |   3.30200833843654
 23 |   212.112338046551
  0 | 0.0512464372091503
  4 | 0.0889713146524469
  6 |  0.219432204682557
  8 |  0.139725343898993
 10 |   1.86468261037506
 12 |  0.559591499901241
 14 |  0.350417460388247
 16 |  0.489541567796517
 18 |   7.46794038691975
 20 |   19.7523463121038
 22 |   47.1138577624204
 24 |   246.551504798816
(24 rows)

Run the test for Proportional Hazards assumption to obtain correlation between residuals and time.

SELECT madlib.cox_zph( 'sample_cox',
                       'sample_zph_output'
                     );

View results of the PHA test.

\x on
SELECT * FROM sample_zph_output;

Results:

-[ RECORD 1 ]-----------------------------------------
covariate  | ARRAY[grp,wbc]
rho        | {0.00237308357328641,0.0375600568840431}
chi_square | {0.000100675718191977,0.0232317400546175}
p_value    | {0.991994376850758,0.878855984657948}

Technical Background

Generally, proportional-hazard models start with a list of \( \boldsymbol n \) observations, each with \( \boldsymbol m \) covariates and a time of death. From this \( \boldsymbol n \times m \) matrix, we would like to derive the correlation between the covariates and the hazard function. This amounts to finding the parameters \( \boldsymbol \beta \) that best fit the model described below.

Let us define:

\( \boldsymbol t \in \mathbf R^{m} \) denote the vector of observed dependent variables, with \( n \) rows.
\( X \in \mathbf R^{m} \) denote the design matrix with \( m \) columns and \( n \) rows, containing all observed vectors of independent variables \( \boldsymbol x_i \) as rows.
\( R(t_i) \) denote the set of observations still alive at time \( t_i \)

Note that this model does not include a constant term, and the data cannot contain a column of 1s.

By definition,

\[ P[T_k = t_i | \boldsymbol R(t_i)] = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}. \,. \]

The partial likelihood function can now be generated as the product of conditional probabilities:

\[ \mathcal L = \prod_{i = 1}^n \left( \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}} \right). \]

The log-likelihood form of this equation is

\[ L = \sum_{i = 1}^n \left[ \beta^T x_i - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right) \right]. \]

Using this score function and Hessian matrix, the partial likelihood can be maximized using the Newton-Raphson algorithm. Breslow's method is used to resolved tied times of deaths. The time of death for two records are considered "equal" if they differ by less than 1.0e-6

The inverse of the Hessian matrix, evaluated at the estimate of \( \boldsymbol \beta \), can be used as an approximate variance-covariance matrix for the estimate, and used to produce approximate standard errors for the regression coefficients.

\[ \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii} \,. \]

The Wald z-statistic is

\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]

The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]

where \( Z \) is a standard normally distributed random variable.

The condition number is computed as \( \kappa(H) \) during the iteration immediately preceding convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Literature

A somewhat random selection of nice write-ups, with valuable pointers into further literature:

[1] John Fox: Cox Proportional-Hazards Regression for Survival Data, Appendix to An R and S-PLUS companion to Applied Regression Feb 2012, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf

[2] Stephen J Walters: What is a Cox model? http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf

Notes

If number of ties in the source table is very large, a memory allocation error may be raised. The limitation is about \((10^8 / m)\), where \(m\) is number of featrues. For instance, if there are 100 featrues, the number of ties should be fewer than 1 million.

Related Topics

File cox_prop_hazards.sql_in documenting the functions