User Documentation
Logistic Regression
+ Collaboration diagram for Logistic Regression:

(Binomial) Logistic regression refers to a stochastic model in which the conditional mean of the dependent dichotomous variable (usually denoted \( Y \in \{ 0,1 \} \)) is the logistic function of an affine function of the vector of independent variables (usually denoted \( \boldsymbol x \)). That is,

\[ E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x) \]

for some unknown vector of coefficients \( \boldsymbol c \) and where \( \sigma(x) = \frac{1}{1 + \exp(-x)} \) is the logistic function. Logistic regression finds the vector of coefficients \( \boldsymbol c \) that maximizes the likelihood of the observations.


By definition,

\[ P[Y = y_i | \boldsymbol x_i] = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i) \,. \]

Maximizing the likelihood \( \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \) is equivalent to maximizing the log-likelihood \( \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \), which simplifies to

\[ l(\boldsymbol c) = -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)) \,. \]

The Hessian of this objective is \( H = -X^T A X \) where \( A = \text{diag}(a_1, \dots, a_n) \) is the diagonal matrix with \( a_i = \sigma(\boldsymbol c^T \boldsymbol x) \cdot \sigma(-\boldsymbol c^T \boldsymbol x) \,. \) Since \( H \) is non-positive definite, \( l(\boldsymbol c) \) is convex. There are many techniques for solving convex optimization problems. Currently, logistic regression in MADlib can use one of three algorithms:

We estimate the standard error for coefficient \( i \) as

\[ \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii} \,. \]

The Wald z-statistic is

\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]

The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]

where \( Z \) is a standard normally distributed random variable.

The odds ratio for coefficient \( i \) is estimated as \( \exp(c_i) \).

The condition number is computed as \( \kappa(X^T A X) \) during the iteration immediately preceding convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.


The training data is expected to be of the following form:

{TABLE|VIEW} sourceName (
    dependentVariable BOOLEAN,
    independentVariables FLOAT8[],
  • Get vector of coefficients \( \boldsymbol c \) and all diagnostic statistics:
    SELECT logregr_train(
        'sourceName', 'outName', 'dependentVariable',
        'independentVariables'[, 'grouping_columns',
        [, numberOfIterations [, 'optimizer' [, precision
        [, verbose ]] ] ] ]
    Output table:
    coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
  • Get vector of coefficients \( \boldsymbol c \):
    SELECT coef from outName; 
  • Get a subset of the output columns, e.g., only the array of coefficients \( \boldsymbol c \), the log-likelihood of determination \( l(\boldsymbol c) \), and the array of p-values \( \boldsymbol p \):
    SELECT coef, log_likelihood, p_values FROM outName; 
  • By default, the option verbose is False. If it is set to be True, warning messages will be output to the SQL client for groups that failed.
  1. Create the sample data set:
    sql> SELECT * FROM data;
                      r1                      | val
     {1,3.01789340097457,0.454183579888195}   | t
     {1,-2.59380532894284,0.602678326424211}  | f
     {1,-1.30643094424158,0.151587064377964}  | t
     {1,3.60722299199551,0.963550757616758}   | t
     {1,-1.52197745628655,0.0782248834148049} | t
     {1,-4.8746574902907,0.345104880165309}   | f
  2. Run the logistic regression function:
    sql> \x on
    Expanded display is off.
    sql> SELECT logregr_train('data', 'out_tbl', 'val', 'r1', Null, 100, 'irls', 0.001);
    sql> SELECT * from out_tbl;
    coef           | {5.59049410898112,2.11077546770772,-0.237276684606453}
    log_likelihood | -467.214718489873
    std_err        | {0.318943457652178,0.101518723785383,0.294509929481773}
    z_stats        | {17.5281667482197,20.7919819024719,-0.805666162169712}
    p_values       | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
    odds_ratios    | {267.867942976278,8.2546400100702,0.788773016471171}
    condition_no   | 179.186118573205
    num_iterations | 9

A somewhat random selection of nice write-ups, with valuable pointers into further literature:

[1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November 2009,

[2] Thomas P. Minka: A comparison of numerical optimizers for logistic regression, 2003 (revised Mar 26, 2007),

[3] Paul Komarek, Andrew W. Moore: Making Logistic Regression A Core Data Mining Tool With TR-IRLS, IEEE International Conference on Data Mining 2005, pp. 685-688,

[4] D. P. Bertsekas: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey, Technical report, Laboratory for Information and Decision Systems, 2010,

[5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro: Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19(4), 2009,

See also:
File logistic.sql_in (documenting the SQL functions)