Contents

A marginal effect (ME) or partial effect measures the effect on the conditional mean of \( y \) of a change in one of the regressors, say \(X_k\). In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, specialized algorithms are required for calculating ME.

MADlib provides marginal effects regression functions for logistic and multinomial logistic regressions.

Logistic Regression Training Function

margins_logregr( source_table, 
                 output_table, 
                 dependent_variable, 
                 independent_variable,
                 grouping_cols,  
                 marginal_variables,  
                 max_iter,
                 optimizer,  
                 tolerance, 
                 verbose_mode
               )

Arguments

source_table

VARCHAR. Name of data table.

output_table

VARCHAR. Name of result table. Overwrites if it exists. The output table has the following columns.

margins	DOUBLE PRECISION[]. The marginal effects.
std_err	DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
t_stats	DOUBLE PRECISION[]. An array of the t-stats of the marginal effects.
p_values	DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.

dependent_variable

VARCHAR. Name of column for dependent variables.

independent_variable

VARCHAR. Name of column for independent variables. Can be any SQL expression that evaluates to an array.

grouping_cols (optional)

VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.

margins_vars (optional)

INTEGER[], default: NULL. Index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.

max_iter (optional)

INTEGER, default: 20. Maximum number of iterations for the logistic regression.

optimizer (optional)

VARCHAR, default: 'irls'. Optimizer to use for the logistic regression: newton/irls, cg, or igd.

tolerance (optional)

DOUBLE PRECISION, default: 1e-4. Termination criterion for logistic regression (relative).

verbose_mode (optional)

BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

anchor train_mlogregr

Multinomial Logistic Regression Training Function

margins_mlogregr( source_table, 
                  output_table, 
                  dependent_variable, 
                  independent_variable,
                  ref_category,
                  grouping_cols,  
                  marginal_variables,  
                  max_iter,
                  optimizer,  
                  tolerance, 
                  verbose_mode
                )

Arguments

source_table

VARCHAR. Name of data table.

output_table

VARCHAR. Name of result table. Overwrites if it exists. The output table has the following columns.

margins	DOUBLE PRECISION[]. The marginal effects.
coef	DOUBLE PRECISION[]. An array of coefficients, \( \boldsymbol c \).
std_err	DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
t_stats	DOUBLE PRECISION[]. An array of the t-stats of the marginal effects.
p_values	DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.

dependent_variable

VARCHAR. Name of column for dependent variables.

independent_variable

VARCHAR. Name of column for independent variables. Can be any SQL expression that evaluates to an array.

ref_category (optional)

INTEGER, default: 0. Reference category for the multinomial logistic regression.

grouping_cols (optional)

VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.

margins_vars (optional)

INTEGER[], default: NULL. Index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.

max_iter (optional)

INTEGER, default: 20. Maximum number of iterations for the logistic regression.

optimizer (optional)

VARCHAR, default: 'irls'. Optimizer to use for the logistic regression: newton/irls, cg, or igd.

tolerance (optional)

DOUBLE PRECISION, default: 1e-4. Termination criterion for logistic regression (relative).

verbose_mode (optional)

BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

Examples

View online help for the marginal effects logistic regression function.
```
SELECT madlib.margins_logregr();
```

Create the sample data set.

SELECT * FROM data;

Result:

 id | second_attack | treatment | trait_anxiety
----+---------------+-----------+---------------
  1 |             1 |         1 |            70
  3 |             1 |         1 |            50
  5 |             1 |         0 |            40
  7 |             1 |         0 |            75
  9 |             1 |         0 |            70
 11 |             0 |         1 |            65
 13 |             0 |         1 |            45
 15 |             0 |         1 |            40
 17 |             0 |         0 |            55
...

Run the logistic regression function and then compute the marginal effects of all variables in the regression.

SELECT madlib.margins_logregr( 'patients', 
                               'result_table', 
                               'second_attack', 
                               'ARRAY[1, treatment, trait_anxiety]'
                            );

View the regression results.

\x ON
SELECT * FROM result_table;

Result:

margins  | {-0.970665392796,-0.156214190168,0.0181587690137}
coef     | {-6.36346994178179,-1.02410605239327,0.119044916668605}
std_err  | {0.802871454422,0.292691682191,0.0137459874022}
t_stats  | {-1.2089922832,-0.533715850748,1.32102325446}
p_values | {0.243212810329,0.600447858606,0.204000202116}

Run the logistic regression function and then compute the marginal effects of the first variable in the regression.

SELECT madlib.margins_logregr( 'patients', 
                               'result_table', 
                               'second_attack', 
                               'ARRAY[1, treatment, trait_anxiety]', 
                               NULL, 
                               ARRAY[1]
                             );
SELECT * FROM result_table;

Result:

margins  | {-0.970665392796}
coef     | {-6.36346994178179}
std_err  | {0.802871454422}
t_stats  | {-1.2089922832}
p_values | {0.243212810329}

View online help for marginal effects multinomial logistic regression.
```
SELECT madlib.margins_mlogregr();
```

Create a sample data set for multinomial logistic regression. (The full dataset has three categories.)

SELECT * FROM data;

Result:

 id |  feature_1    | feature_2 | category     
----+---------------+-----------+---------------
  1 |             1 |        35 |            1
  3 |             2 |        33 |            0
  5 |             3 |        39 |            1
  7 |             1 |        37 |            1
  9 |             2 |        31 |            1
 11 |             3 |        36 |            1
 13 |             2 |        36 |            1
 15 |             2 |        36 |            0
 17 |             2 |        31 |            5
...

Get help for the marginal effects multinomial logistic regression function.
```
SELECT margins_mlogregr('help');
```

Run the regression function and then compute the marginal effects of all variables in the regression.

SELECT madlib.margins_mlogregr( 'test_data', 
                                'result_table', 
                                'category', 
                                'ARRAY[1, feature_1, feature_2]'
                              );
SELECT * FROM result_table;

Result:

margins  | {0.741613239156,-0.032868883552,-0.0144502990691,-0.972055011831,0.112337273885,0.0172621628253}
std_err  | {0.183172236055,0.044184899499,0.00332608999704,0.263532615748,0.0555196094594,0.00457999429836}
t_stats  | {4.04872078394,-0.743894043547,-4.34453038911,-3.68855676202,2.02338011702,3.76903587663}
p_values | {7.43784735554e-05,0.457840607871,2.24855476205e-05,0.000292799037776,0.0444060346517,0.000217384008015}

Run the regression and compute the marginal effects for the first dependent variable (all categories).

SELECT madlib.margins_mlogregr( 'test_data', 
                                'result_table', 
                                'category', 
                                'ARRAY[1, feature_1, feature_2]', 
                                0, 
                                NULL, 
                                ARRAY[1]
                              );
SELECT * FROM result_table;

Result:

margins  | {0.741613239156,-0.972055011831}
std_err  | {0.183172236055,0.263532615748}
t_stats  | {4.04872078394,-3.68855676202}
p_values | {7.43784735554e-05,0.000292799037776}

Notes

Multinomial-Logistic Regression Notes

The reference category ranges from [0, numCategories-1]. The default reference category is zero.
The marginal effects and supporting statistics are output in the following order. For a problem with K dependent variables (1, ..., K) and J categories (0,..., J-1). The output is \( \{ m_{k_1, j_0} \ldots m_{k_K, j_{0}}, m_{k_1, j_1} \ldots m_{k_K, j_{1}}, m_{k_1, j_{J-1}} \ldots m_{k_K, j_{J-1}} \} \). The order is NOT CONSISTENT with mlogregr, robust_variance_mlogregr and clustered_variance_mlogregr. This is deliberate; the interfaces of the other functions will be changed to match that used in the marginal effects function.
Selectively picking the marginal effects of the independent variables will return the marginal effects of all categories associated with the independent variable.
The independent_varname argument can be the name of a column that contains an array of numeric values. It can also be a string with the format 'ARRAY[1, x1, x2, x3]', where x1, x2, and x3 are column names.
The vars argument is an index list (base 1) with the set of indices amongst the independent variables that must be selected. Use NULL to use all independent variables.

Technical Background

The standard approach to modeling dichotomous/binary variables (so \(y \in \{0, 1\} \)) is to estimate a generalized linear model under the assumption that \( y \) follows some form of Bernoulli distribution. Thus the expected value of \( y \) becomes,

\[ y = G(X' \beta), \]

where G is the specified binomial distribution. For logistic regression, the function \( G \) represents the inverse logit function.

In logistic regression:

\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}} = \frac{1}{1 + e^{-z}} \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ = \beta_k \cdot P \cdot (1-P) \]

There are several methods for calculating the marginal effects for dichotomous dependent variables. This package uses the average of the marginal effects at every sample observation.

This is calculated as follows:

\[ \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\ \text{where}, P(y_i=1) = g(X^{(i)}\beta) \]

We use the delta method for calculating standard errors on the marginal effects.

Literature

[1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option

Related Topics

File marginal.sql_in documenting the SQL functions.