MADlib
1.3 A newer version is available
User Documentation
|
A marginal effect (ME) or partial effect measures the effect on the conditional mean of \( y \) of a change in one of the regressors, say \(X_k\). In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, specialized algorithms are required for calculating ME.
MADlib provides marginal effects regression functions for logistic and multinomial logistic regressions.
margins_logregr( source_table, output_table, dependent_variable, independent_variable, grouping_cols, marginal_variables, max_iter, optimizer, tolerance, verbose_mode )Arguments
margins | DOUBLE PRECISION[]. The marginal effects. |
---|---|
std_err | DOUBLE PRECISION[]. An array of the standard errors, using the delta method. |
t_stats | DOUBLE PRECISION[]. An array of the t-stats of the marginal effects. |
p_values | DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects. |
anchor train_mlogregr
margins_mlogregr( source_table, output_table, dependent_variable, independent_variable, ref_category, grouping_cols, marginal_variables, max_iter, optimizer, tolerance, verbose_mode )Arguments
margins | DOUBLE PRECISION[]. The marginal effects. |
---|---|
coef | DOUBLE PRECISION[]. An array of coefficients, \( \boldsymbol c \). |
std_err | DOUBLE PRECISION[]. An array of the standard errors, using the delta method. |
t_stats | DOUBLE PRECISION[]. An array of the t-stats of the marginal effects. |
p_values | DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects. |
SELECT madlib.margins_logregr();
SELECT * FROM data;Result:
id | second_attack | treatment | trait_anxiety ----+---------------+-----------+--------------- 1 | 1 | 1 | 70 3 | 1 | 1 | 50 5 | 1 | 0 | 40 7 | 1 | 0 | 75 9 | 1 | 0 | 70 11 | 0 | 1 | 65 13 | 0 | 1 | 45 15 | 0 | 1 | 40 17 | 0 | 0 | 55 ...
SELECT madlib.margins_logregr( 'patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]' );
\x ON SELECT * FROM result_table;Result:
margins | {-0.970665392796,-0.156214190168,0.0181587690137} coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} std_err | {0.802871454422,0.292691682191,0.0137459874022} t_stats | {-1.2089922832,-0.533715850748,1.32102325446} p_values | {0.243212810329,0.600447858606,0.204000202116}
SELECT madlib.margins_logregr( 'patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]', NULL, ARRAY[1] ); SELECT * FROM result_table;Result:
margins | {-0.970665392796} coef | {-6.36346994178179} std_err | {0.802871454422} t_stats | {-1.2089922832} p_values | {0.243212810329}
SELECT madlib.margins_mlogregr();
SELECT * FROM data;Result:
id | feature_1 | feature_2 | category ----+---------------+-----------+--------------- 1 | 1 | 35 | 1 3 | 2 | 33 | 0 5 | 3 | 39 | 1 7 | 1 | 37 | 1 9 | 2 | 31 | 1 11 | 3 | 36 | 1 13 | 2 | 36 | 1 15 | 2 | 36 | 0 17 | 2 | 31 | 5 ...
SELECT margins_mlogregr('help');
SELECT madlib.margins_mlogregr( 'test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]' ); SELECT * FROM result_table;Result:
margins | {0.741613239156,-0.032868883552,-0.0144502990691,-0.972055011831,0.112337273885,0.0172621628253} std_err | {0.183172236055,0.044184899499,0.00332608999704,0.263532615748,0.0555196094594,0.00457999429836} t_stats | {4.04872078394,-0.743894043547,-4.34453038911,-3.68855676202,2.02338011702,3.76903587663} p_values | {7.43784735554e-05,0.457840607871,2.24855476205e-05,0.000292799037776,0.0444060346517,0.000217384008015}
SELECT madlib.margins_mlogregr( 'test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]', 0, NULL, ARRAY[1] ); SELECT * FROM result_table;Result:
margins | {0.741613239156,-0.972055011831} std_err | {0.183172236055,0.263532615748} t_stats | {4.04872078394,-3.68855676202} p_values | {7.43784735554e-05,0.000292799037776}
Multinomial-Logistic Regression Notes
The standard approach to modeling dichotomous/binary variables (so \(y \in \{0, 1\} \)) is to estimate a generalized linear model under the assumption that \( y \) follows some form of Bernoulli distribution. Thus the expected value of \( y \) becomes,
\[ y = G(X' \beta), \]
where G is the specified binomial distribution. For logistic regression, the function \( G \) represents the inverse logit function.
In logistic regression:
\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}} = \frac{1}{1 + e^{-z}} \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ = \beta_k \cdot P \cdot (1-P) \]
There are several methods for calculating the marginal effects for dichotomous dependent variables. This package uses the average of the marginal effects at every sample observation.
This is calculated as follows:
\[ \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\ \text{where}, P(y_i=1) = g(X^{(i)}\beta) \]
We use the delta method for calculating standard errors on the marginal effects.
[1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option
File marginal.sql_in documenting the SQL functions.