Collaboration diagram for Logistic Regression:

About:

(Binomial) Logistic regression refers to a stochastic model in which the conditional mean of the dependent dichotomous variable (usually denoted \( Y \in \{ 0,1 \} \)) is the logistic function of an affine function of the vector of independent variables (usually denoted \( \boldsymbol x \)). That is,

\[ E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x) \]

for some unknown vector of coefficients \( \boldsymbol c \) and where \( \sigma(x) = \frac{1}{1 + \exp(-x)} \) is the logistic function. Logistic regression finds the vector of coefficients \( \boldsymbol c \) that maximizes the likelihood of the observations.

Let

\( \boldsymbol y \in \{ 0,1 \}^n \) denote the vector of observed dependent variables, with \( n \) rows, containing the observed values of the dependent variable,
\( X \in \mathbf R^{n \times k} \) denote the design matrix with \( k \) columns and \( n \) rows, containing all observed vectors of independent variables \( \boldsymbol x_i \) as rows.

By definition,

\[ P[Y = y_i | \boldsymbol x_i] = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i) \,. \]

Maximizing the likelihood \( \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \) is equivalent to maximizing the log-likelihood \( \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \), which simplifies to

\[ l(\boldsymbol c) = -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)) \,. \]

The Hessian of this objective is \( H = -X^T A X \) where \( A = \text{diag}(a_1, \dots, a_n) \) is the diagonal matrix with \( a_i = \sigma(\boldsymbol c^T \boldsymbol x) \cdot \sigma(-\boldsymbol c^T \boldsymbol x) \,. \) Since \( H \) is non-positive definite, \( l(\boldsymbol c) \) is convex. There are many techniques for solving convex optimization problems. Currently, logistic regression in MADlib can use one of three algorithms:

Iteratively Reweighted Least Squares
A conjugate-gradient approach, also known as Fletcher-Reeves method in the literature, where we use the Hestenes-Stiefel rule for calculating the step size.
Incremental gradient descent, also known as incremental gradient methods or stochastic gradient descent in the literature.

We estimate the standard error for coefficient \( i \) as

\[ \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii} \,. \]

The Wald z-statistic is

\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]

The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]

where \( Z \) is a standard normally distributed random variable.

The odds ratio for coefficient \( i \) is estimated as \( \exp(c_i) \).

The condition number is computed as \( \kappa(X^T A X) \) during the iteration immediately preceding convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Input:

The training data for logistic regression is expected to be of the following form:

{TABLE|VIEW} sourceName (
    ...
    dependentVariable BOOLEAN,
    independentVariables FLOAT8[],
    ...
)

Usage:

The logistic regression has several options in what it can return:

Get vector of coefficients \( \boldsymbol c \) and all diagnostic statistics:

SELECT logregr_train(
    'sourceName', 'outName', 'dependentVariable',
    'independentVariables'[, 'grouping_columns',
    [, numberOfIterations [, 'optimizer' [, precision
    [, verbose ]] ] ] ]
);

Output table:

coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
-----+----------------+---------+---------+----------+-------------+--------------+---------------
                                               ...

Get vector of coefficients \( \boldsymbol c \):
```
SELECT coef from outName; 
```
Get a subset of the output columns, e.g., only the array of coefficients \( \boldsymbol c \), the log-likelihood of determination \( l(\boldsymbol c) \), and the array of p-values \( \boldsymbol p \):
```
SELECT coef, log_likelihood, p_values FROM outName; 
```
By default, the option verbose is False. If it is set to be True, warning messages will be output to the SQL client for groups that failed.

Examples:


Create the sample data set:
sql> SELECT * FROM data;
                  r1                      | val
---------------------------------------------+-----
 {1,3.01789340097457,0.454183579888195}   | t
 {1,-2.59380532894284,0.602678326424211}  | f
 {1,-1.30643094424158,0.151587064377964}  | t
 {1,3.60722299199551,0.963550757616758}   | t
 {1,-1.52197745628655,0.0782248834148049} | t
 {1,-4.8746574902907,0.345104880165309}   | f
...

Run the logistic regression function:
sql> \x on
Expanded display is off.
sql> SELECT logregr_train('data', 'out_tbl', 'val', 'r1', Null, 100, 'irls', 0.001);
sql> SELECT * from out_tbl;
coef           | {5.59049410898112,2.11077546770772,-0.237276684606453}
log_likelihood | -467.214718489873
std_err        | {0.318943457652178,0.101518723785383,0.294509929481773}
z_stats        | {17.5281667482197,20.7919819024719,-0.805666162169712}
p_values       | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
odds_ratios    | {267.867942976278,8.2546400100702,0.788773016471171}
condition_no   | 179.186118573205
num_iterations | 9

Literature:

A somewhat random selection of nice write-ups, with valuable pointers into further literature.

[1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November
    2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf

[2] Thomas P. Minka: A comparison of numerical optimizers for logistic
    regression, 2003 (revised Mar 26, 2007),
    http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf

[3] Paul Komarek, Andrew W. Moore: Making Logistic Regression A Core Data Mining
    Tool With TR-IRLS, IEEE International Conference on Data Mining 2005,
    pp. 685-688, http://komarix.org/ac/papers/tr-irls.short.pdf

[4] D. P. Bertsekas: Incremental gradient, subgradient, and proximal methods for
    convex optimization: a survey, Technical report, Laboratory for Information
    and Decision Systems, 2010,
    http://web.mit.edu/dimitrib/www/Incremental_Survey_LIDS.pdf

[5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro: Robust stochastic
    approximation approach to stochastic programming, SIAM Journal on
    Optimization, 19(4), 2009, http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdf

See Also
File logistic.sql_in (documenting the SQL functions)