User Documentation
multilogistic.sql_in
Go to the documentation of this file.
00001 /* ----------------------------------------------------------------------- *//**
00002  *
00003  * @file multilogistic.sql_in
00004  *
00005  * @brief SQL functions for multinomial logistic regression
00006  * @date July 2012
00007  *
00008  * @sa For a brief introduction to multinomial logistic regression, see the
00009  *     module description \ref grp_mlogreg.
00010  *
00011  *//* ----------------------------------------------------------------------- */
00012 
00013 m4_include(`SQLCommon.m4')
00014 
00015 /**
00016 @addtogroup grp_mlogreg
00017 
00018 @about
00019 
00020 Multinomial logistic regression is a widely used regression analysis tool that models the outcomes of categorical dependent
00021 random variables (denoted \f$ Y \in \{ 0,1,2 \ldots k \} \f$). The models assumes that the conditional mean of the
00022 dependant categorical variables is the logistic function of an affine combination of independent
00023 variables (usually denoted \f$ \boldsymbol x \f$). That is,
00024 \f[
00025     E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x)
00026 \f]
00027 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where
00028 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Multinomial Logistic
00029 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes
00030 the likelihood of the observations.
00031 
00032 Let
00033 - \f$ \boldsymbol y \in \{ 0,1 \}^{n \times k} \f$ denote the vector of observed dependent
00034   variables, with \f$ n \f$ rows and \f$ k \f$ columns, containing the observed values of the
00035   dependent variable,
00036 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$
00037   columns and \f$ n \f$ rows, containing all observed vectors of independent
00038   variables \f$ \boldsymbol x_i \f$ as rows.
00039 
00040 By definition,
00041 \f[
00042     P[Y = y_i | \boldsymbol x_i]
00043     =   \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i)
00044     \,.
00045 \f]
00046 Maximizing the likelihood
00047 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$
00048 is equivalent to maximizing the log-likelihood
00049 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to
00050 \f[
00051     l(\boldsymbol c) =
00052         -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i}
00053             \cdot \boldsymbol c^T \boldsymbol x_i))
00054     \,.
00055 \f]
00056 The Hessian of this objective is \f$ H = -X^T A X \f$ where
00057 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with
00058 \f$
00059     a_i = \sigma(\boldsymbol c^T \boldsymbol x)
00060           \cdot
00061           \sigma(-\boldsymbol c^T \boldsymbol x)
00062     \,.
00063 \f$
00064 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex.
00065 There are many techniques for solving convex optimization problems. Currently,
00066 logistic regression in MADlib can use:
00067 - Iteratively Reweighted Least Squares
00068 
00069 We estimate the standard error for coefficient \f$ i \f$ as
00070 \f[
00071     \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii}
00072     \,.
00073 \f]
00074 The Wald z-statistic is
00075 \f[
00076     z_i = \frac{c_i}{\mathit{se}(c_i)}
00077     \,.
00078 \f]
00079 
00080 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under
00081 the assumptions inherent in the Wald test) of seeing a value at least as extreme
00082 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is
00083 true. Letting \f$ F \f$ denote the cumulative density function of a standard
00084 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is
00085 therefore
00086 \f[
00087     p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| ))
00088 \f]
00089 where \f$ Z \f$ is a standard normally distributed random variable.
00090 
00091 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$.
00092 
00093 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration
00094 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using
00095 the coefficients of the previous iteration). A large condition number (say, more
00096 than 1000) indicates the presence of significant multicollinearity.
00097 
00098 
00099 @input
00100 
00101 The training data is expected to be of the following form:\n
00102 <pre>{TABLE|VIEW} <em>sourceName</em> (
00103     ...
00104     <em>dependentVariable</em> INTEGER,
00105     <em>numCategories</em> INTEGER,
00106     <em>independentVariables</em> FLOAT8[],
00107     ...
00108 )</pre>
00109 
00110 @usage
00111 - The number of independent variables cannot exceed 65535.
00112 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
00113   statistics:\n
00114   <pre>SELECT * FROM \ref mlogregr(
00115     '<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>' , '<em>independentVariables</em>'
00116     [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ]
00117 );</pre>
00118   Output:
00119   <pre>coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
00120 -----+----------------+---------+---------+----------+-------------+--------------+---------------
00121                                                ...
00122 </pre>
00123 - Get vector of coefficients \f$ \boldsymbol c \f$:\n
00124   <pre>SELECT (\ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>', '<em>independentVariables</em>')).coef;</pre>
00125 - Get a subset of the output columns, e.g., only the array of coefficients
00126   \f$ \boldsymbol c \f$, the log-likelihood of determination
00127   \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
00128   <pre>SELECT coef, log_likelihood, p_values
00129 FROM \ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>',  '<em>independentVariables</em>');</pre>
00130 
00131 Note that the categories are encoded as integers with values from {0, 1, 2,...numCategories}
00132 @examp
00133 
00134 -# Create the sample data set:
00135 @verbatim
00136 sql> SELECT * FROM data;
00137                   r1                      | val
00138 ---------------------------------------------+-----
00139  {1,3.01789340097457,0.454183579888195}   | 1
00140  {1,-2.59380532894284,0.602678326424211}  | 0
00141  {1,-1.30643094424158,0.151587064377964}  | 1
00142  {1,3.60722299199551,0.963550757616758}   | 1
00143  {1,-1.52197745628655,0.0782248834148049} | 1
00144  {1,-4.8746574902907,0.345104880165309}   | 0
00145 ...
00146 @endverbatim
00147 -# Run the multi-logistic regression function:
00148 @verbatim
00149 sql> \x on
00150 Expanded display is off.
00151 sql> SELECT * FROM mlogregr('data', 'val', '2', 'r1', 100, 'irls', 0.001);
00152 -[ RECORD 1 ]--+--------------------------------------------------------------
00153 coef           | {5.59049410898112,2.11077546770772,-0.237276684606453}
00154 log_likelihood | -467.214718489873
00155 std_err        | {0.318943457652178,0.101518723785383,0.294509929481773}
00156 z_stats        | {17.5281667482197,20.7919819024719,-0.805666162169712}
00157 p_values       | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518}
00158 odds_ratios    | {267.867942976278,8.2546400100702,0.788773016471171}
00159 condition_no   | 179.186118573205
00160 num_iterations | 9
00161 
00162 @endverbatim
00163 
00164 @literature
00165 
00166 A collection of nice write-ups, with valuable pointers into
00167 further literature:
00168 
00169 [1] Annette J . Dobson: An Introduction to Generalized Linear Models, Second Edition.  Nov 2001
00170 
00171 [2] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November
00172     2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf
00173 
00174 [3] Srikrishna Sridhar, Mark Wellons, Caleb Welton: Multilogistic Regression:
00175         Notes and References, Jul 12 2012, http://www.cs.wisc.edu/~srikris/mlogit.pdf
00176 
00177 [4] Scott A. Czepiel: Maximum Likelihood Estimation
00178         of Logistic Regression Models: Theory and Implementation,
00179         Retrieved Jul 12 2012, http://czep.net/stat/mlelr.pdf
00180 
00181 
00182 
00183 @sa File multilogistic.sql_in (documenting the SQL functions)
00184 
00185 @internal
00186 @sa Namespace multilogistic (documenting the driver/outer loop implemented in
00187     Python), Namespace
00188     \ref madlib::modules::regress documenting the implementation in C++
00189 @endinternal
00190 
00191 */
00192 
00193 
00194 DROP TYPE IF EXISTS MADLIB_SCHEMA.mlogregr_result;
00195 CREATE TYPE MADLIB_SCHEMA.mlogregr_result AS (
00196     coef DOUBLE PRECISION[],
00197     log_likelihood DOUBLE PRECISION,
00198     std_err DOUBLE PRECISION[],
00199     z_stats DOUBLE PRECISION[],
00200     p_values DOUBLE PRECISION[],
00201     odds_ratios DOUBLE PRECISION[],
00202     condition_no DOUBLE PRECISION,
00203     num_iterations INTEGER
00204 );
00205 
00206 
00207 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_transition(
00208     DOUBLE PRECISION[],
00209     INTEGER,
00210     INTEGER,
00211     DOUBLE PRECISION[],
00212     DOUBLE PRECISION[])
00213 RETURNS DOUBLE PRECISION[]
00214 AS 'MODULE_PATHNAME'
00215 LANGUAGE C IMMUTABLE;
00216 
00217 
00218 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_merge_states(
00219     state1 DOUBLE PRECISION[],
00220     state2 DOUBLE PRECISION[])
00221 RETURNS DOUBLE PRECISION[]
00222 AS 'MODULE_PATHNAME'
00223 LANGUAGE C IMMUTABLE STRICT;
00224 
00225 
00226 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_final(
00227     state DOUBLE PRECISION[])
00228 RETURNS DOUBLE PRECISION[]
00229 AS 'MODULE_PATHNAME'
00230 LANGUAGE C IMMUTABLE STRICT;
00231 
00232 
00233 /**
00234  * @internal
00235  * @brief Perform one iteration of the iteratively-reweighted-least-squares
00236  *        method for computing linear regression
00237  */
00238 CREATE AGGREGATE MADLIB_SCHEMA.mlogregr_irls_step(
00239     /*+ y */ INTEGER,
00240     /*+ numCategories */ INTEGER,
00241     /*+ x */ DOUBLE PRECISION[],
00242     /*+ previous_state */ DOUBLE PRECISION[]) (
00243 
00244     STYPE=DOUBLE PRECISION[],
00245     SFUNC=MADLIB_SCHEMA.mlogregr_irls_step_transition,
00246     m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.mlogregr_irls_step_merge_states,')
00247     FINALFUNC=MADLIB_SCHEMA.mlogregr_irls_step_final,
00248     INITCOND='{0,0,0,0}'
00249 );
00250 
00251 
00252 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_mlogregr_irls_step_distance(
00253     /*+ state1 */ DOUBLE PRECISION[],
00254     /*+ state2 */ DOUBLE PRECISION[])
00255 RETURNS DOUBLE PRECISION AS
00256 'MODULE_PATHNAME'
00257 LANGUAGE c IMMUTABLE STRICT;
00258 
00259 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_mlogregr_irls_result(
00260     /*+ state */ DOUBLE PRECISION[])
00261 RETURNS MADLIB_SCHEMA.mlogregr_result AS
00262 'MODULE_PATHNAME'
00263 LANGUAGE c IMMUTABLE STRICT;
00264 
00265 
00266 -- We only need to document the last one (unfortunately, in Greenplum we have to
00267 -- use function overloading instead of default arguments).
00268 CREATE FUNCTION MADLIB_SCHEMA.compute_mlogregr(
00269     "source" VARCHAR,
00270     "depvar" VARCHAR,
00271     "numcategories" INTEGER,
00272     "indepvar" VARCHAR,
00273     "maxnumiterations" INTEGER,
00274     "optimizer" VARCHAR,
00275     "precision" DOUBLE PRECISION)
00276 RETURNS INTEGER
00277 AS $$PythonFunction(regress, multilogistic, compute_mlogregr)$$
00278 LANGUAGE plpythonu VOLATILE;
00279 
00280 /**
00281  * @brief Compute logistic-regression coefficients and diagnostic statistics
00282  *
00283  * To include an intercept in the model, set one coordinate in the
00284  * <tt>independentVariables</tt> array to 1.
00285  *
00286  * @param source Name of the source relation containing the training data
00287  * @param depvar Name of the dependent column (of type INTEGER < numcategories - 1)
00288  * @param numcategories Number of categories for the dependant variables (
00289                     of type INTEGER)
00290  * @param indepvar Name of the independent column (of type DOUBLE
00291  *        PRECISION[])
00292  * @param maxnumiterations The maximum number of iterations
00293  * @param optimizer The optimizer to use (
00294  *        <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least
00295  *        squares)
00296  * @param precision The difference between log-likelihood values in successive
00297  *        iterations that should indicate convergence. Note that a non-positive
00298  *        value here disables the convergence criterion, and execution will only
00299  *        stop after \ maxNumIterations iterations.
00300  *
00301  * @return A composite value:
00302  *  - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$
00303  *  - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$
00304  *  - <tt>std_err FLOAT8[]</tt> - Array of standard errors,
00305  *    \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$
00306  *  - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$
00307  *  - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$
00308  *  - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios,
00309  *    \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$
00310  *  - <tt>condition_no FLOAT8</tt> - The condition number of matrix
00311  *    \f$ X^T A X \f$ during the iteration immediately <em>preceding</em>
00312  *    convergence (i.e., \f$ A \f$ is computed using the coefficients of the
00313  *    previous iteration)
00314  *  - <tt>num_iterations INTEGER</tt> - The number of iterations before the
00315  *    algorithm terminated
00316  *
00317  * @usage
00318  *  - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic
00319  *    statistics:\n
00320  *    <pre>SELECT * FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
00321  *    '<em>numCategories</em>', '<em>independentVariables</em>');</pre>
00322  *  - Get vector of coefficients \f$ \boldsymbol c \f$:\n
00323  *    <pre>SELECT (mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
00324  *    '<em>numCategories</em>', '<em>independentVariables</em>')).coef;</pre>
00325  *  - Get a subset of the output columns, e.g., only the array of coefficients
00326  *    \f$ \boldsymbol c \f$, the log-likelihood of determination
00327  *    \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$:
00328  *    <pre>SELECT coef, log_likelihood, p_values
00329  *    FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>',
00330  *   '<em>numCategories</em>', '<em>independentVariables</em>');</pre>
00331  *
00332  * @note This function starts an iterative algorithm. It is not an aggregate
00333  *       function. Source and column names have to be passed as strings (due to
00334  *       limitations of the SQL syntax).
00335  *
00336  * @internal
00337  * @sa This function is a wrapper for multilogistic::compute_mlogregr(), which
00338  *     sets the default values.
00339  */
00340 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
00341     "source" VARCHAR,
00342     "depvar" VARCHAR,
00343     "numcategories" INTEGER,
00344     "indepvar" VARCHAR,
00345     "maxnumiterations" INTEGER /*+ DEFAULT 20 */,
00346     "optimizer" VARCHAR /*+ DEFAULT 'irls' */,
00347     "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */)
00348 RETURNS MADLIB_SCHEMA.mlogregr_result AS $$
00349 DECLARE
00350     observed_count INTEGER;
00351     theIteration INTEGER;
00352     fnName VARCHAR;
00353     theResult MADLIB_SCHEMA.mlogregr_result;
00354 BEGIN
00355     IF (SELECT atttypid::regtype <> 'INTEGER'::regtype
00356         FROM pg_attribute
00357         WHERE attrelid = source::regclass AND attname = depvar) THEN
00358         RAISE EXCEPTION 'The dependent variable column should be of type INTEGER';
00359     END IF;
00360 
00361     EXECUTE $sql$ SELECT count(DISTINCT $sql$ || depvar || $sql$ )
00362                     FROM $sql$ || textin(regclassout(source))
00363             INTO observed_count;
00364     IF observed_count <> numcategories
00365     THEN
00366         RAISE WARNING 'Results will be undefined, if ''numcategories'' is not
00367          same as the number of distinct categories observed in the training data.';
00368     END IF;
00369 
00370     IF optimizer = 'irls' OR optimizer = 'newton' THEN
00371         fnName := 'internal_mlogregr_irls_result';
00372     ELSE
00373         RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer;
00374     END IF;
00375 
00376     theIteration := (
00377         SELECT MADLIB_SCHEMA.compute_mlogregr($1, $2, $3, $4, $5, $6, $7)
00378     );
00379     -- Because of Greenplum bug MPP-10050, we have to use dynamic SQL (using
00380     -- EXECUTE) in the following
00381     -- Because of Greenplum bug MPP-6731, we have to hide the tuple-returning
00382     -- function in a subquery
00383     EXECUTE
00384         $sql$
00385         SELECT (result).*
00386         FROM (
00387             SELECT
00388                 MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result
00389                 FROM _madlib_iterative_alg
00390                 WHERE _madlib_iteration = $sql$ || theIteration || $sql$
00391             ) subq
00392         $sql$
00393         INTO theResult;
00394     -- The number of iterations are not updated in the C++ code. We do it here.
00395     IF NOT (theResult IS NULL) THEN
00396         theResult.num_iterations = theIteration;
00397     END IF;
00398     RETURN theResult;
00399 END;
00400 $$ LANGUAGE plpgsql VOLATILE;
00401 
00402 
00403 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
00404     "source" VARCHAR,
00405     "depvar" VARCHAR,
00406     "numcategories" INTEGER,
00407     "indepvar" VARCHAR)
00408 RETURNS MADLIB_SCHEMA.mlogregr_result AS
00409 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, 20, 'irls', 0.0001);$$
00410 LANGUAGE sql VOLATILE;
00411 
00412 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
00413     "source" VARCHAR,
00414     "depvar" VARCHAR,
00415     "numcategories" INTEGER,
00416     "indepvar" VARCHAR,
00417     "maxnumiterations" INTEGER)
00418 RETURNS MADLIB_SCHEMA.mlogregr_result AS
00419 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, $5, 'irls', 0.0001);$$
00420 LANGUAGE sql VOLATILE;
00421 
00422 CREATE FUNCTION MADLIB_SCHEMA.mlogregr(
00423     "source" VARCHAR,
00424     "depvar" VARCHAR,
00425     "numcategories" INTEGER,
00426     "indepvar" VARCHAR,
00427     "maxbumiterations" INTEGER,
00428     "optimizer" VARCHAR)
00429 RETURNS MADLIB_SCHEMA.mlogregr_result AS
00430 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, $5, $6, 0.0001);$$
00431 LANGUAGE sql VOLATILE;
00432