MADlib
0.7 A newer version is available
User Documentation
|
00001 /* ----------------------------------------------------------------------- *//** 00002 * 00003 * @file multilogistic.sql_in 00004 * 00005 * @brief SQL functions for multinomial logistic regression 00006 * @date July 2012 00007 * 00008 * @sa For a brief introduction to multinomial logistic regression, see the 00009 * module description \ref grp_mlogreg. 00010 * 00011 *//* ----------------------------------------------------------------------- */ 00012 00013 m4_include(`SQLCommon.m4') 00014 00015 /** 00016 @addtogroup grp_mlogreg 00017 00018 @about 00019 00020 Multinomial logistic regression is a widely used regression analysis tool that models the outcomes of categorical dependent 00021 random variables (denoted \f$ Y \in \{ 0,1,2 \ldots k \} \f$). The models assumes that the conditional mean of the 00022 dependant categorical variables is the logistic function of an affine combination of independent 00023 variables (usually denoted \f$ \boldsymbol x \f$). That is, 00024 \f[ 00025 E[Y \mid \boldsymbol x] = \sigma(\boldsymbol c^T \boldsymbol x) 00026 \f] 00027 for some unknown vector of coefficients \f$ \boldsymbol c \f$ and where 00028 \f$ \sigma(x) = \frac{1}{1 + \exp(-x)} \f$ is the logistic function. Multinomial Logistic 00029 regression finds the vector of coefficients \f$ \boldsymbol c \f$ that maximizes 00030 the likelihood of the observations. 00031 00032 Let 00033 - \f$ \boldsymbol y \in \{ 0,1 \}^{n \times k} \f$ denote the vector of observed dependent 00034 variables, with \f$ n \f$ rows and \f$ k \f$ columns, containing the observed values of the 00035 dependent variable, 00036 - \f$ X \in \mathbf R^{n \times k} \f$ denote the design matrix with \f$ k \f$ 00037 columns and \f$ n \f$ rows, containing all observed vectors of independent 00038 variables \f$ \boldsymbol x_i \f$ as rows. 00039 00040 By definition, 00041 \f[ 00042 P[Y = y_i | \boldsymbol x_i] 00043 = \sigma((-1)^{y_i} \cdot \boldsymbol c^T \boldsymbol x_i) 00044 \,. 00045 \f] 00046 Maximizing the likelihood 00047 \f$ \prod_{i=1}^n \Pr(Y = y_i \mid \boldsymbol x_i) \f$ 00048 is equivalent to maximizing the log-likelihood 00049 \f$ \sum_{i=1}^n \log \Pr(Y = y_i \mid \boldsymbol x_i) \f$, which simplifies to 00050 \f[ 00051 l(\boldsymbol c) = 00052 -\sum_{i=1}^n \log(1 + \exp((-1)^{y_i} 00053 \cdot \boldsymbol c^T \boldsymbol x_i)) 00054 \,. 00055 \f] 00056 The Hessian of this objective is \f$ H = -X^T A X \f$ where 00057 \f$ A = \text{diag}(a_1, \dots, a_n) \f$ is the diagonal matrix with 00058 \f$ 00059 a_i = \sigma(\boldsymbol c^T \boldsymbol x) 00060 \cdot 00061 \sigma(-\boldsymbol c^T \boldsymbol x) 00062 \,. 00063 \f$ 00064 Since \f$ H \f$ is non-positive definite, \f$ l(\boldsymbol c) \f$ is convex. 00065 There are many techniques for solving convex optimization problems. Currently, 00066 logistic regression in MADlib can use: 00067 - Iteratively Reweighted Least Squares 00068 00069 We estimate the standard error for coefficient \f$ i \f$ as 00070 \f[ 00071 \mathit{se}(c_i) = \left( (X^T A X)^{-1} \right)_{ii} 00072 \,. 00073 \f] 00074 The Wald z-statistic is 00075 \f[ 00076 z_i = \frac{c_i}{\mathit{se}(c_i)} 00077 \,. 00078 \f] 00079 00080 The Wald \f$ p \f$-value for coefficient \f$ i \f$ gives the probability (under 00081 the assumptions inherent in the Wald test) of seeing a value at least as extreme 00082 as the one observed, provided that the null hypothesis (\f$ c_i = 0 \f$) is 00083 true. Letting \f$ F \f$ denote the cumulative density function of a standard 00084 normal distribution, the Wald \f$ p \f$-value for coefficient \f$ i \f$ is 00085 therefore 00086 \f[ 00087 p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) 00088 \f] 00089 where \f$ Z \f$ is a standard normally distributed random variable. 00090 00091 The odds ratio for coefficient \f$ i \f$ is estimated as \f$ \exp(c_i) \f$. 00092 00093 The condition number is computed as \f$ \kappa(X^T A X) \f$ during the iteration 00094 immediately <em>preceding</em> convergence (i.e., \f$ A \f$ is computed using 00095 the coefficients of the previous iteration). A large condition number (say, more 00096 than 1000) indicates the presence of significant multicollinearity. 00097 00098 00099 @input 00100 00101 The training data is expected to be of the following form:\n 00102 <pre>{TABLE|VIEW} <em>sourceName</em> ( 00103 ... 00104 <em>dependentVariable</em> INTEGER, 00105 <em>numCategories</em> INTEGER, 00106 <em>independentVariables</em> FLOAT8[], 00107 ... 00108 )</pre> 00109 00110 @usage 00111 - The number of independent variables cannot exceed 65535. 00112 - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic 00113 statistics:\n 00114 <pre>SELECT * FROM \ref mlogregr( 00115 '<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>' , '<em>independentVariables</em>' 00116 [, <em>numberOfIterations</em> [, '<em>optimizer</em>' [, <em>precision</em> ] ] ] 00117 );</pre> 00118 Output: 00119 <pre>coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations 00120 -----+----------------+---------+---------+----------+-------------+--------------+--------------- 00121 ... 00122 </pre> 00123 - Get vector of coefficients \f$ \boldsymbol c \f$:\n 00124 <pre>SELECT (\ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>', '<em>independentVariables</em>')).coef;</pre> 00125 - Get a subset of the output columns, e.g., only the array of coefficients 00126 \f$ \boldsymbol c \f$, the log-likelihood of determination 00127 \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$: 00128 <pre>SELECT coef, log_likelihood, p_values 00129 FROM \ref mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', '<em>numCategories</em>', '<em>independentVariables</em>');</pre> 00130 00131 Note that the categories are encoded as integers with values from {0, 1, 2,...numCategories} 00132 @examp 00133 00134 -# Create the sample data set: 00135 @verbatim 00136 sql> SELECT * FROM data; 00137 r1 | val 00138 ---------------------------------------------+----- 00139 {1,3.01789340097457,0.454183579888195} | 1 00140 {1,-2.59380532894284,0.602678326424211} | 0 00141 {1,-1.30643094424158,0.151587064377964} | 1 00142 {1,3.60722299199551,0.963550757616758} | 1 00143 {1,-1.52197745628655,0.0782248834148049} | 1 00144 {1,-4.8746574902907,0.345104880165309} | 0 00145 ... 00146 @endverbatim 00147 -# Run the multi-logistic regression function: 00148 @verbatim 00149 sql> \x on 00150 Expanded display is off. 00151 sql> SELECT * FROM mlogregr('data', 'val', '2', 'r1', 100, 'irls', 0.001); 00152 -[ RECORD 1 ]--+-------------------------------------------------------------- 00153 coef | {5.59049410898112,2.11077546770772,-0.237276684606453} 00154 log_likelihood | -467.214718489873 00155 std_err | {0.318943457652178,0.101518723785383,0.294509929481773} 00156 z_stats | {17.5281667482197,20.7919819024719,-0.805666162169712} 00157 p_values | {8.73403463417837e-69,5.11539430631541e-96,0.420435365338518} 00158 odds_ratios | {267.867942976278,8.2546400100702,0.788773016471171} 00159 condition_no | 179.186118573205 00160 num_iterations | 9 00161 00162 @endverbatim 00163 00164 @literature 00165 00166 A collection of nice write-ups, with valuable pointers into 00167 further literature: 00168 00169 [1] Annette J . Dobson: An Introduction to Generalized Linear Models, Second Edition. Nov 2001 00170 00171 [2] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 18 November 00172 2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture-26.pdf 00173 00174 [3] Srikrishna Sridhar, Mark Wellons, Caleb Welton: Multilogistic Regression: 00175 Notes and References, Jul 12 2012, http://www.cs.wisc.edu/~srikris/mlogit.pdf 00176 00177 [4] Scott A. Czepiel: Maximum Likelihood Estimation 00178 of Logistic Regression Models: Theory and Implementation, 00179 Retrieved Jul 12 2012, http://czep.net/stat/mlelr.pdf 00180 00181 00182 00183 @sa File multilogistic.sql_in (documenting the SQL functions) 00184 00185 @internal 00186 @sa Namespace multilogistic (documenting the driver/outer loop implemented in 00187 Python), Namespace 00188 \ref madlib::modules::regress documenting the implementation in C++ 00189 @endinternal 00190 00191 */ 00192 00193 00194 DROP TYPE IF EXISTS MADLIB_SCHEMA.mlogregr_result; 00195 CREATE TYPE MADLIB_SCHEMA.mlogregr_result AS ( 00196 coef DOUBLE PRECISION[], 00197 log_likelihood DOUBLE PRECISION, 00198 std_err DOUBLE PRECISION[], 00199 z_stats DOUBLE PRECISION[], 00200 p_values DOUBLE PRECISION[], 00201 odds_ratios DOUBLE PRECISION[], 00202 condition_no DOUBLE PRECISION, 00203 num_iterations INTEGER 00204 ); 00205 00206 00207 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_transition( 00208 DOUBLE PRECISION[], 00209 INTEGER, 00210 INTEGER, 00211 DOUBLE PRECISION[], 00212 DOUBLE PRECISION[]) 00213 RETURNS DOUBLE PRECISION[] 00214 AS 'MODULE_PATHNAME' 00215 LANGUAGE C IMMUTABLE; 00216 00217 00218 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_merge_states( 00219 state1 DOUBLE PRECISION[], 00220 state2 DOUBLE PRECISION[]) 00221 RETURNS DOUBLE PRECISION[] 00222 AS 'MODULE_PATHNAME' 00223 LANGUAGE C IMMUTABLE STRICT; 00224 00225 00226 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.mlogregr_irls_step_final( 00227 state DOUBLE PRECISION[]) 00228 RETURNS DOUBLE PRECISION[] 00229 AS 'MODULE_PATHNAME' 00230 LANGUAGE C IMMUTABLE STRICT; 00231 00232 00233 /** 00234 * @internal 00235 * @brief Perform one iteration of the iteratively-reweighted-least-squares 00236 * method for computing linear regression 00237 */ 00238 CREATE AGGREGATE MADLIB_SCHEMA.mlogregr_irls_step( 00239 /*+ y */ INTEGER, 00240 /*+ numCategories */ INTEGER, 00241 /*+ x */ DOUBLE PRECISION[], 00242 /*+ previous_state */ DOUBLE PRECISION[]) ( 00243 00244 STYPE=DOUBLE PRECISION[], 00245 SFUNC=MADLIB_SCHEMA.mlogregr_irls_step_transition, 00246 m4_ifdef(`__GREENPLUM__',`prefunc=MADLIB_SCHEMA.mlogregr_irls_step_merge_states,') 00247 FINALFUNC=MADLIB_SCHEMA.mlogregr_irls_step_final, 00248 INITCOND='{0,0,0,0}' 00249 ); 00250 00251 00252 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_mlogregr_irls_step_distance( 00253 /*+ state1 */ DOUBLE PRECISION[], 00254 /*+ state2 */ DOUBLE PRECISION[]) 00255 RETURNS DOUBLE PRECISION AS 00256 'MODULE_PATHNAME' 00257 LANGUAGE c IMMUTABLE STRICT; 00258 00259 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_mlogregr_irls_result( 00260 /*+ state */ DOUBLE PRECISION[]) 00261 RETURNS MADLIB_SCHEMA.mlogregr_result AS 00262 'MODULE_PATHNAME' 00263 LANGUAGE c IMMUTABLE STRICT; 00264 00265 00266 -- We only need to document the last one (unfortunately, in Greenplum we have to 00267 -- use function overloading instead of default arguments). 00268 CREATE FUNCTION MADLIB_SCHEMA.compute_mlogregr( 00269 "source" VARCHAR, 00270 "depvar" VARCHAR, 00271 "numcategories" INTEGER, 00272 "indepvar" VARCHAR, 00273 "maxnumiterations" INTEGER, 00274 "optimizer" VARCHAR, 00275 "precision" DOUBLE PRECISION) 00276 RETURNS INTEGER 00277 AS $$PythonFunction(regress, multilogistic, compute_mlogregr)$$ 00278 LANGUAGE plpythonu VOLATILE; 00279 00280 /** 00281 * @brief Compute logistic-regression coefficients and diagnostic statistics 00282 * 00283 * To include an intercept in the model, set one coordinate in the 00284 * <tt>independentVariables</tt> array to 1. 00285 * 00286 * @param source Name of the source relation containing the training data 00287 * @param depvar Name of the dependent column (of type INTEGER < numcategories - 1) 00288 * @param numcategories Number of categories for the dependant variables ( 00289 of type INTEGER) 00290 * @param indepvar Name of the independent column (of type DOUBLE 00291 * PRECISION[]) 00292 * @param maxnumiterations The maximum number of iterations 00293 * @param optimizer The optimizer to use ( 00294 * <tt>'irls'</tt>/<tt>'newton'</tt> for iteratively reweighted least 00295 * squares) 00296 * @param precision The difference between log-likelihood values in successive 00297 * iterations that should indicate convergence. Note that a non-positive 00298 * value here disables the convergence criterion, and execution will only 00299 * stop after \ maxNumIterations iterations. 00300 * 00301 * @return A composite value: 00302 * - <tt>coef FLOAT8[]</tt> - Array of coefficients, \f$ \boldsymbol c \f$ 00303 * - <tt>log_likelihood FLOAT8</tt> - Log-likelihood \f$ l(\boldsymbol c) \f$ 00304 * - <tt>std_err FLOAT8[]</tt> - Array of standard errors, 00305 * \f$ \mathit{se}(c_1), \dots, \mathit{se}(c_k) \f$ 00306 * - <tt>z_stats FLOAT8[]</tt> - Array of Wald z-statistics, \f$ \boldsymbol z \f$ 00307 * - <tt>p_values FLOAT8[]</tt> - Array of Wald p-values, \f$ \boldsymbol p \f$ 00308 * - <tt>odds_ratios FLOAT8[]</tt>: Array of odds ratios, 00309 * \f$ \mathit{odds}(c_1), \dots, \mathit{odds}(c_k) \f$ 00310 * - <tt>condition_no FLOAT8</tt> - The condition number of matrix 00311 * \f$ X^T A X \f$ during the iteration immediately <em>preceding</em> 00312 * convergence (i.e., \f$ A \f$ is computed using the coefficients of the 00313 * previous iteration) 00314 * - <tt>num_iterations INTEGER</tt> - The number of iterations before the 00315 * algorithm terminated 00316 * 00317 * @usage 00318 * - Get vector of coefficients \f$ \boldsymbol c \f$ and all diagnostic 00319 * statistics:\n 00320 * <pre>SELECT * FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', 00321 * '<em>numCategories</em>', '<em>independentVariables</em>');</pre> 00322 * - Get vector of coefficients \f$ \boldsymbol c \f$:\n 00323 * <pre>SELECT (mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', 00324 * '<em>numCategories</em>', '<em>independentVariables</em>')).coef;</pre> 00325 * - Get a subset of the output columns, e.g., only the array of coefficients 00326 * \f$ \boldsymbol c \f$, the log-likelihood of determination 00327 * \f$ l(\boldsymbol c) \f$, and the array of p-values \f$ \boldsymbol p \f$: 00328 * <pre>SELECT coef, log_likelihood, p_values 00329 * FROM mlogregr('<em>sourceName</em>', '<em>dependentVariable</em>', 00330 * '<em>numCategories</em>', '<em>independentVariables</em>');</pre> 00331 * 00332 * @note This function starts an iterative algorithm. It is not an aggregate 00333 * function. Source and column names have to be passed as strings (due to 00334 * limitations of the SQL syntax). 00335 * 00336 * @internal 00337 * @sa This function is a wrapper for multilogistic::compute_mlogregr(), which 00338 * sets the default values. 00339 */ 00340 CREATE FUNCTION MADLIB_SCHEMA.mlogregr( 00341 "source" VARCHAR, 00342 "depvar" VARCHAR, 00343 "numcategories" INTEGER, 00344 "indepvar" VARCHAR, 00345 "maxnumiterations" INTEGER /*+ DEFAULT 20 */, 00346 "optimizer" VARCHAR /*+ DEFAULT 'irls' */, 00347 "precision" DOUBLE PRECISION /*+ DEFAULT 0.0001 */) 00348 RETURNS MADLIB_SCHEMA.mlogregr_result AS $$ 00349 DECLARE 00350 observed_count INTEGER; 00351 theIteration INTEGER; 00352 fnName VARCHAR; 00353 theResult MADLIB_SCHEMA.mlogregr_result; 00354 BEGIN 00355 IF (SELECT atttypid::regtype <> 'INTEGER'::regtype 00356 FROM pg_attribute 00357 WHERE attrelid = source::regclass AND attname = depvar) THEN 00358 RAISE EXCEPTION 'The dependent variable column should be of type INTEGER'; 00359 END IF; 00360 00361 EXECUTE $sql$ SELECT count(DISTINCT $sql$ || depvar || $sql$ ) 00362 FROM $sql$ || textin(regclassout(source)) 00363 INTO observed_count; 00364 IF observed_count <> numcategories 00365 THEN 00366 RAISE WARNING 'Results will be undefined, if ''numcategories'' is not 00367 same as the number of distinct categories observed in the training data.'; 00368 END IF; 00369 00370 IF optimizer = 'irls' OR optimizer = 'newton' THEN 00371 fnName := 'internal_mlogregr_irls_result'; 00372 ELSE 00373 RAISE EXCEPTION 'Unknown optimizer (''%'')', optimizer; 00374 END IF; 00375 00376 theIteration := ( 00377 SELECT MADLIB_SCHEMA.compute_mlogregr($1, $2, $3, $4, $5, $6, $7) 00378 ); 00379 -- Because of Greenplum bug MPP-10050, we have to use dynamic SQL (using 00380 -- EXECUTE) in the following 00381 -- Because of Greenplum bug MPP-6731, we have to hide the tuple-returning 00382 -- function in a subquery 00383 EXECUTE 00384 $sql$ 00385 SELECT (result).* 00386 FROM ( 00387 SELECT 00388 MADLIB_SCHEMA.$sql$ || fnName || $sql$(_madlib_state) AS result 00389 FROM _madlib_iterative_alg 00390 WHERE _madlib_iteration = $sql$ || theIteration || $sql$ 00391 ) subq 00392 $sql$ 00393 INTO theResult; 00394 -- The number of iterations are not updated in the C++ code. We do it here. 00395 IF NOT (theResult IS NULL) THEN 00396 theResult.num_iterations = theIteration; 00397 END IF; 00398 RETURN theResult; 00399 END; 00400 $$ LANGUAGE plpgsql VOLATILE; 00401 00402 00403 CREATE FUNCTION MADLIB_SCHEMA.mlogregr( 00404 "source" VARCHAR, 00405 "depvar" VARCHAR, 00406 "numcategories" INTEGER, 00407 "indepvar" VARCHAR) 00408 RETURNS MADLIB_SCHEMA.mlogregr_result AS 00409 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, 20, 'irls', 0.0001);$$ 00410 LANGUAGE sql VOLATILE; 00411 00412 CREATE FUNCTION MADLIB_SCHEMA.mlogregr( 00413 "source" VARCHAR, 00414 "depvar" VARCHAR, 00415 "numcategories" INTEGER, 00416 "indepvar" VARCHAR, 00417 "maxnumiterations" INTEGER) 00418 RETURNS MADLIB_SCHEMA.mlogregr_result AS 00419 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, $5, 'irls', 0.0001);$$ 00420 LANGUAGE sql VOLATILE; 00421 00422 CREATE FUNCTION MADLIB_SCHEMA.mlogregr( 00423 "source" VARCHAR, 00424 "depvar" VARCHAR, 00425 "numcategories" INTEGER, 00426 "indepvar" VARCHAR, 00427 "maxbumiterations" INTEGER, 00428 "optimizer" VARCHAR) 00429 RETURNS MADLIB_SCHEMA.mlogregr_result AS 00430 $$SELECT MADLIB_SCHEMA.mlogregr($1, $2, $3, $4, $5, $6, 0.0001);$$ 00431 LANGUAGE sql VOLATILE; 00432