Collaboration diagram for Marginal Effects:

About:

A marginal effect (ME) or partial effect measures the effect on the conditional mean of \( y \) of a change in one of the regressors, say \(X_k\). In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, we require specialized algorithms for calculating ME.

The standard approach to modeling dichotomous/binary variables (so \(y \in \{0, 1\} \)) is to estimate a generalized linear model under the assumption that \( y \) follows some form of Bernoulli distribution. Thus the expected value of \( y \) becomes,

\[ y = G(X' \beta), \]

where G is the specified binomial distribution. For logistic regression, the funciton \( G \) represents the inverse logit function.

In logistic regression:

\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}} = \frac{1}{1 + e^{-z}} \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ = \beta_k \cdot P \cdot (1-P) \]

There are several methods for calculating the marginal effects for dichotomous dependent variables. This package uses the average of the marginal effects at every sample observation.

This is calculated as follows:

\[ \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\ \text{where}, P(y_i=1) = g(X^{(i)}\beta) \]

We use the delta method for calculating standard errors on the marginal effects.

Input:

The input parameters are expected to be of the following form:

 margins_logregr (
    sourceTable VARCHAR,
    outputTable VARCHAR,
    dependentVariable VARCHAR,
    independentVariable VARCHAR,
    [groupingCol VARCHAR,
    marginal_vars INTEGER[],
    max_iter INTEGER,
    optimizer VARCHAR,
    tolerance DOUBLE PRECISION
    ]
)

For multinomial logistic regression the input parameters are expected to be of the following form:

 margins_mlogregr (
    sourceTable VARCHAR,
    outputTable VARCHAR,
    dependentVariable VARCHAR,
    independentVariable VARCHAR,
    [groupingCol VARCHAR,
    referenceCategory INTEGER,
    marginal_vars INTEGER[],
    max_iter INTEGER,
    optimizer VARCHAR,
    tolerance DOUBLE PRECISION
    ]
)

Warning: The 'groupingCol' input parameter for margins_logregr and margins_mlogregr is a placeholder in the Madlib V1.0. These inputs will be implemented in a future release.

Usage:

The Interface

For logistic regression, one can call the following function

SELECT madlib.margins_logregr(
    'source_table',              -- name of input table, VARCHAR
    'out_table',                 -- name of output table, VARCHAR
    'dependent_varname',         -- dependent variable, VARCHAR
    'independent_varname',       -- independent variable, VARCHAR
    [ 'grouping_cols',           -- comma separated list of grouping vars, VARCHAR (Default NULL)
    'marginal_effect_variables', -- Index list (base 1) with positions representing which marginal variable to calculate, INTEGER[] (Default NULL)
     max_iter,                   -- Maximum number of iterations to run the logistic regression INTEGER (Default 20)
    'optimizer',                 -- Optimizer used for logistic regression VARCHAR (Default & recommended 'irls')
    tolerance,                   -- Tolerance for the logistic regression DOUBLE PRECISION (default 1e-4)
    ]
);

For multinomial logistic regression, one can call the following function

SELECT madlib.(
    'source_table',              -- name of input table, VARCHAR
    'out_table',                 -- name of output table, VARCHAR
    'dependent_varname',         -- dependent variable, VARCHAR
    'independent_varname',       -- independent variable, VARCHAR
    [ 'reference_category',      -- Reference category for multinomial logistic regression INTEGER
     'grouping_cols',           -- comma separated list of grouping vars, VARCHAR (Default NULL)
    'marginal_effect_variables', -- Index list (base 1) with positions representing which marginal variable to calculate, INTEGER[] (Default NULL)
     max_iter,                   -- Maximum number of iterations to run the logistic regression INTEGER (Default 20)
    'optimizer',                 -- Optimizer used for logistic regression VARCHAR (Default & recommended 'irls')
    tolerance,                   -- Tolerance for the logistic regression DOUBLE PRECISION (default 1e-4)
    ]
);

Output is stored in the out_table:

[ margins | std_err | t_stats | p_values |
+------+---------+-------+----------+

Multinomial-Logistic Regression Notes

The reference category ranges from [0, numCategories-1]. The default reference category is zero.
The marginal effects and supporting statistics are output in the following order. For a problem with K dependent variables (1, ..., K) and J categories (0,..., J-1). The output is \( \{ m_{k_1, j_0} \ldots m_{k_K, j_{0}}, m_{k_1, j_1} \ldots m_{k_K, j_{1}}, m_{k_1, j_{J-1}} \ldots m_{k_K, j_{J-1}} \} \). The order is NOT CONSISTENT with mlogregr, robust_variance_mlogregr and clustered_variance_mlogregr. This is deliberate because the interfaces of all the other functions will be moved to match that used in marginal.
Selectively picking the marginal effects of the independent variables will return the marginal effects of ALL categories associated with the independent variable.
Here the 'independent_varname' can be the name of a column, which contains array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]', where x1, x2 and x3 are all column names.
Here the 'vars' is an index list (base 1) with the set of indices amongst the independent variables that must be selected. One can also chose NULL< > to pick all indepependent variables.

Examples:

Create the sample data set:
sql> SELECT * FROM data; id | second_attack | treatment | trait_anxiety ----+---------------+-----------+--------------- 1 | 1 | 1 | 70 3 | 1 | 1 | 50 5 | 1 | 0 | 40 7 | 1 | 0 | 75 9 | 1 | 0 | 70 11 | 0 | 1 | 65 13 | 0 | 1 | 45 15 | 0 | 1 | 40 17 | 0 | 0 | 55 ...

For function summary information. Run
sql> select margins_logregr('help'); OR sql> select margins_logregr(); OR sql> select margins_logregr('');

Run the logistic regression function and then compute the marginal effects of all variables in the regression:
sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]'); sql> select * from result_table; margins | {-0.970665392796,-0.156214190168,0.0181587690137} coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} std_err | {0.802871454422,0.292691682191,0.0137459874022} t_stats | {-1.2089922832,-0.533715850748,1.32102325446} p_values | {0.243212810329,0.600447858606,0.204000202116} -# Alternate Syntax: Run the logistic regression function and then compute the marginal effects if all variables in the regression: sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]', NULL, NULL); sql> select * from result_table; margins | {-0.970665392796,-0.156214190168,0.0181587690137} coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} std_err | {0.802871454422,0.292691682191,0.0137459874022} t_stats | {-1.2089922832,-0.533715850748,1.32102325446} p_values | {0.243212810329,0.600447858606,0.204000202116} -# Run the logistic regression function and then compute the marginal effects of the first variable in the regression sql> select margins_logregr('patients', 'result_table', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]', NULL, ARRAY[1]); sql> select * from result_table; margins | {-0.970665392796} coef | {-6.36346994178179} std_err | {0.802871454422} t_stats | {-1.2089922832} p_values | {0.243212810329}

Examples:

Create the sample data set (the full dataset has 3 categories):
sql> SELECT * FROM data; id | feature_1 | feature_2 | category ----+---------------+-----------+--------------- 1 | 1 | 35 | 1 3 | 2 | 33 | 0 5 | 3 | 39 | 1 7 | 1 | 37 | 1 9 | 2 | 31 | 1 11 | 3 | 36 | 1 13 | 2 | 36 | 1 15 | 2 | 36 | 0 17 | 2 | 31 | 5 ...

For function summary information. Run
sql> select margins_mlogregr('help'); OR sql> select margins_mlogregr(); OR sql> select margins_mlogregr('');

Run the regression function and then compute the marginal effects of all variables in the regression (see docs for detailed order)
sql> select margins_mlogregr('test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]'); sql> select * from result_table; margins | {0.741613239156,-0.032868883552,-0.0144502990691,-0.972055011831,0.112337273885,0.0172621628253} std_err | {0.183172236055,0.044184899499,0.00332608999704,0.263532615748,0.0555196094594,0.00457999429836} t_stats | {4.04872078394,-0.743894043547,-4.34453038911,-3.68855676202,2.02338011702,3.76903587663} p_values | {7.43784735554e-05,0.457840607871,2.24855476205e-05,0.000292799037776,0.0444060346517,0.000217384008015}

Run the regression and compute the marginals effects for the first dependent variable (all categories)
sql> select margins_mlogregr('test_data', 'result_table', 'category', 'ARRAY[1, feature_1, feature_2]', 0, NULL, ARRAY[1]); sql> select * from result_table; margins | {0.741613239156,-0.972055011831} std_err | {0.183172236055,0.263532615748} t_stats | {4.04872078394,-3.68855676202} p_values | {7.43784735554e-05,0.000292799037776}

-# For function usage information. sql> select margins_mlogregr('usage');

Literature:

[1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option

See Also
File marginal.sql_in documenting the SQL functions.