User Documentation
 All Files Functions Variables Groups

A marginal effect (ME) or partial effect measures the effect on the conditional mean of \( y \) of a change in one of the regressors, say \(X_k\). In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, specialized algorithms are required for calculating ME.

MADlib provides marginal effects regression functions for logistic and multinomial logistic regressions.

Logistic Regression Training Function
margins_logregr( source_table, 
                 output_table, 
                 dependent_variable, 
                 independent_variable,
                 grouping_cols,  
                 marginal_variables,  
                 max_iter,
                 optimizer,  
                 tolerance, 
                 verbose_mode
               )
Arguments
source_table
VARCHAR. Name of data table.
output_table
VARCHAR. Name of result table. Overwrites if it exists. The output table has the following columns.
margins DOUBLE PRECISION[]. The marginal effects.
std_err DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
t_stats DOUBLE PRECISION[]. An array of the t-stats of the marginal effects.
p_values DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.
dependent_variable
VARCHAR. Name of column for dependent variables.
independent_variable
VARCHAR. Name of column for independent variables. Can be any SQL expression that evaluates to an array.
grouping_cols (optional)
VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
margins_vars (optional)
INTEGER[], default: NULL. Index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.
max_iter (optional)
INTEGER, default: 20. Maximum number of iterations for the logistic regression.
optimizer (optional)
VARCHAR, default: 'irls'. Optimizer to use for the logistic regression: newton/irls, cg, or igd.
tolerance (optional)
DOUBLE PRECISION, default: 1e-4. Termination criterion for logistic regression (relative).
verbose_mode (optional)
BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

anchor train_mlogregr

Multinomial Logistic Regression Training Function
margins_mlogregr( source_table, 
                  output_table, 
                  dependent_variable, 
                  independent_variable,
                  ref_category,
                  grouping_cols,  
                  marginal_variables,  
                  max_iter,
                  optimizer,  
                  tolerance, 
                  verbose_mode
                )
Arguments
source_table
VARCHAR. Name of data table.
output_table
VARCHAR. Name of result table. Overwrites if it exists. The output table has the following columns.
margins DOUBLE PRECISION[]. The marginal effects.
coef DOUBLE PRECISION[]. An array of coefficients, \( \boldsymbol c \).
std_err DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
t_stats DOUBLE PRECISION[]. An array of the t-stats of the marginal effects.
p_values DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.
dependent_variable
VARCHAR. Name of column for dependent variables.
independent_variable
VARCHAR. Name of column for independent variables. Can be any SQL expression that evaluates to an array.
ref_category (optional)
INTEGER, default: 0. Reference category for the multinomial logistic regression.
grouping_cols (optional)
VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
margins_vars (optional)
INTEGER[], default: NULL. Index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.
max_iter (optional)
INTEGER, default: 20. Maximum number of iterations for the logistic regression.
optimizer (optional)
VARCHAR, default: 'irls'. Optimizer to use for the logistic regression: newton/irls, cg, or igd.
tolerance (optional)
DOUBLE PRECISION, default: 1e-4. Termination criterion for logistic regression (relative).
verbose_mode (optional)
BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

Examples
  1. View online help for the marginal effects logistic regression function.
    SELECT madlib.margins_logregr();
    
  2. Create the sample data set.
    SELECT * FROM data;
    
    Result:
     id | second_attack | treatment | trait_anxiety
    ----+---------------+-----------+---------------
      1 |             1 |         1 |            70
      3 |             1 |         1 |            50
      5 |             1 |         0 |            40
      7 |             1 |         0 |            75
      9 |             1 |         0 |            70
     11 |             0 |         1 |            65
     13 |             0 |         1 |            45
     15 |             0 |         1 |            40
     17 |             0 |         0 |            55
    ...
    
  3. Run the logistic regression function and then compute the marginal effects of all variables in the regression.
    SELECT madlib.margins_logregr( 'patients', 
                                   'result_table', 
                                   'second_attack', 
                                   'ARRAY[1, treatment, trait_anxiety]'
                                );
    
  4. View the regression results.
    \x ON
    SELECT * FROM result_table;
    
    Result:
    margins  | {-0.970665392796,-0.156214190168,0.0181587690137}
    coef     | {-6.36346994178179,-1.02410605239327,0.119044916668605}
    std_err  | {0.802871454422,0.292691682191,0.0137459874022}
    t_stats  | {-1.2089922832,-0.533715850748,1.32102325446}
    p_values | {0.243212810329,0.600447858606,0.204000202116}
    
  5. Run the logistic regression function and then compute the marginal effects of the first variable in the regression.
    SELECT madlib.margins_logregr( 'patients', 
                                   'result_table', 
                                   'second_attack', 
                                   'ARRAY[1, treatment, trait_anxiety]', 
                                   NULL, 
                                   ARRAY[1]
                                 );
    SELECT * FROM result_table;
    
    Result:
    margins  | {-0.970665392796}
    coef     | {-6.36346994178179}
    std_err  | {0.802871454422}
    t_stats  | {-1.2089922832}
    p_values | {0.243212810329}
    
  6. View online help for marginal effects multinomial logistic regression.
    SELECT madlib.margins_mlogregr();
    
  7. Create a sample data set for multinomial logistic regression. (The full dataset has three categories.)
    SELECT * FROM data;
    
    Result:
     id |  feature_1    | feature_2 | category     
    ----+---------------+-----------+---------------
      1 |             1 |        35 |            1
      3 |             2 |        33 |            0
      5 |             3 |        39 |            1
      7 |             1 |        37 |            1
      9 |             2 |        31 |            1
     11 |             3 |        36 |            1
     13 |             2 |        36 |            1
     15 |             2 |        36 |            0
     17 |             2 |        31 |            5
    ...
    
  8. Get help for the marginal effects multinomial logistic regression function.
    SELECT margins_mlogregr('help');
    
  9. Run the regression function and then compute the marginal effects of all variables in the regression.
    SELECT madlib.margins_mlogregr( 'test_data', 
                                    'result_table', 
                                    'category', 
                                    'ARRAY[1, feature_1, feature_2]'
                                  );
    SELECT * FROM result_table;
    
    Result:
    margins  | {0.741613239156,-0.032868883552,-0.0144502990691,-0.972055011831,0.112337273885,0.0172621628253}
    std_err  | {0.183172236055,0.044184899499,0.00332608999704,0.263532615748,0.0555196094594,0.00457999429836}
    t_stats  | {4.04872078394,-0.743894043547,-4.34453038911,-3.68855676202,2.02338011702,3.76903587663}
    p_values | {7.43784735554e-05,0.457840607871,2.24855476205e-05,0.000292799037776,0.0444060346517,0.000217384008015}
    
  10. Run the regression and compute the marginal effects for the first dependent variable (all categories).
    SELECT madlib.margins_mlogregr( 'test_data', 
                                    'result_table', 
                                    'category', 
                                    'ARRAY[1, feature_1, feature_2]', 
                                    0, 
                                    NULL, 
                                    ARRAY[1]
                                  );
    SELECT * FROM result_table;
    
    Result:
    margins  | {0.741613239156,-0.972055011831}
    std_err  | {0.183172236055,0.263532615748}
    t_stats  | {4.04872078394,-3.68855676202}
    p_values | {7.43784735554e-05,0.000292799037776}
    

Notes

Multinomial-Logistic Regression Notes

Technical Background

The standard approach to modeling dichotomous/binary variables (so \(y \in \{0, 1\} \)) is to estimate a generalized linear model under the assumption that \( y \) follows some form of Bernoulli distribution. Thus the expected value of \( y \) becomes,

\[ y = G(X' \beta), \]

where G is the specified binomial distribution. For logistic regression, the function \( G \) represents the inverse logit function.

In logistic regression:

\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}} = \frac{1}{1 + e^{-z}} \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ = \beta_k \cdot P \cdot (1-P) \]

There are several methods for calculating the marginal effects for dichotomous dependent variables. This package uses the average of the marginal effects at every sample observation.

This is calculated as follows:

\[ \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\ \text{where}, P(y_i=1) = g(X^{(i)}\beta) \]

We use the delta method for calculating standard errors on the marginal effects.

Literature

[1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option

Related Topics

File marginal.sql_in documenting the SQL functions.