2.1.0
User Documentation for Apache MADlib

A marginal effect (ME) or partial effect measures the effect on the conditional mean of \( y \) for a change in one of the regressors, say \(X_k\). In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, specialized algorithms are required for calculating ME. The marginal effect computed is the average of the marginal effect at every data point present in the source table.

MADlib provides marginal effects regression functions for linear, logistic and multinomial logistic regressions. The implementation is similar to reference [1].

Marginal Effects with Interaction Terms
margins( model_table,
         output_table,
         x_design,
         source_table,
         marginal_vars
       )
Arguments
model_table
VARCHAR. The name of the model table, which is the output of logregr_train() or mlogregr_train().
output_table
VARCHAR. The name of the result table. The output table has the following columns.
variables INTEGER[]. The indices of the basis variables.
margins DOUBLE PRECISION[]. The marginal effects.
std_err DOUBLE PRECISION[]. An array of the standard errors, computed using the delta method.
z_stats DOUBLE PRECISION[]. An array of the z-stats of the marginal effects.
p_values DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.
x_design (optional)

VARCHAR, default: NULL. The design of independent variables, necessary only if interaction term or indicator (categorical) terms are present. This parameter is necessary since the independent variables in the underlying regression is not parsed to extract the relationship between variables.

Example: The independent_varname in the regression method can be specified in either of the following ways:

  • ‘array[1, color_blue, color_green, gender_female, gpa, gpa^2, gender_female*gpa, gender_female*gpa^2, weight]’
  • ‘x’

In the second version, the column x is an array containing data identical to that expressed in the first version, computed in a prior data preparation step. Supply an x_design argument to the margins() function in the following way:

  • ‘1, i.color_blue.color, i.color_green.color, i.gender_female, gpa, gpa^2, gender_female*gpa, gender_female*gpa^2, weight’

The variable names ('gpa', 'weight', ...), referred to here as identifiers, should be unique for each basis variable and need not be the same as the original variable name in independent_varname. They should, however, be in the same order as the corresponding variables in independent_varname. The length of x_design is expected to be the same as the length of independent_varname. Each identifier name can contain only alphanumeric characters and the underscore.

Indicator (dummy) variables are prefixed with an 'i.' (This is only necessary for the basis term; it is not needed in the interaction terms.) Indicator variables that are obtained from the same categorical variable (for example, 'color_blue' and 'color_green') need to have a common and unique suffix (for example, '.color'). The '.' is used to add the prefix and suffix. If a reference indicator variable is present, it should contain the prefix 'ir.'.

An identifier may contain alphanumeric characters and underscores. To include other characters, the string must be double-quoted. Escape-characters are not currently supported.

source_table (optional)

VARCHAR, default: NULL. Name of the data table to apply marginal effects on. If not provided or NULL then the marginal effects are computed on the training table.

marginal_vars (optional)
VARCHAR, default: NULL. Comma-separated string containing specific variable identifiers to calculate marginal effects on. When NULL, marginal effects for all variables are returned.
Note
No output will be provided for the reference indicator variable, since the marginal effect for that variable is undefined. If a reference variable is included in the independent variables and marginal_vars, the margins() function will ignore that variable for the output. The variable can still be included in the regression and margins, since it will affect the values for other related indicator variables.

Marginal Effects for Logistic Regression
Warning
This function has been deprecated in favor of the margins() function.
margins_logregr( source_table,
                 output_table,
                 dependent_variable,
                 independent_variable,
                 grouping_cols,
                 marginal_vars,
                 max_iter,
                 optimizer,
                 tolerance,
                 verbose_mode
               )

Arguments

source_table
VARCHAR. The name of the data table.
output_table

VARCHAR. The name of the result table. The output table has the following columns.

margins DOUBLE PRECISION[]. The marginal effects.
std_err DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
z_stats DOUBLE PRECISION[]. An array of the z-stats of the marginal effects.
p_values DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.

A summary table named <output_table>_summary is also created, which is the same as the summary table created by logregr_train() function. Refer to the documentation for logistic regression for details.

dependent_variable
VARCHAR. The name of the column for dependent variables.
independent_variable
VARCHAR. The name of the column for independent variables. Can be any SQL expression that evaluates to an array.
grouping_cols (optional)
VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
marginal_vars (optional)
INTEGER[], default: NULL. An index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.
max_iter (optional)
INTEGER, default: 20. The maximum number of iterations for the logistic regression.
optimizer (optional)
VARCHAR, default: 'irls'. The optimizer to use for the logistic regression: newton/irls, cg, or igd.
tolerance (optional)
DOUBLE PRECISION, default: 1e-4. Termination criterion for logistic regression (relative).
verbose_mode (optional)
BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

Marginal Effects for Multinomial Logistic Regression
Warning
This function has been deprecated in favor of the margins() function.
margins_mlogregr( source_table,
                  out_table,
                  dependent_varname,
                  independent_varname,
                  ref_category,
                  grouping_cols,
                  marginal_vars,
                  optimizer_params,
                  verbose_mode
                )

Arguments

source_table
VARCHAR. The name of data table.
out_table

VARCHAR. The name of result table. The output table has the following columns.

category The category.
ref_category The refererence category used for modeling.
margins DOUBLE PRECISION[]. The marginal effects.
std_err DOUBLE PRECISION[]. An array of the standard errors, using the delta method.
z_stats DOUBLE PRECISION[]. An array of the z-stats of the marginal effects.
p_values DOUBLE PRECISION[]. An array of the Wald p-values of the marginal effects.

A summary table named <out_table>_summary is also created, which is the same as the summary table created by mlogregr_train() function. Refer to the documentation for multinomial logistic regression for details.

dependent_varname
VARCHAR. The name of the column for dependent variables.
independent_varname
VARCHAR. The name of the column for independent variables. Can be any SQL expression that evaluates to an array.
ref_category (optional)
INTEGER, default: 0. Reference category for the multinomial logistic regression.
grouping_cols (optional)
VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
marginal_vars(optional)
INTEGER[], default: NULL. An index list (base 1) representing the independent variables to compute marginal effects on. When NULL, computes marginal effects on all variables.
optimizer_params (optional)
TEXT, default: NULL, which uses the default values of optimizer parameters: max_iter=20, optimizer='newton', tolerance=1e-4. It should be a string that contains 'key=value' pairs separated by commas.
verbose_mode (optional)
BOOLEAN, default FALSE. When TRUE, provides verbose output of the results of training.

Examples
  1. Create the sample data set. Use the patients dataset from the Logistic Regression examples.
    DROP TABLE IF EXISTS patients;
    CREATE TABLE patients( id INTEGER NOT NULL,
                           second_attack INTEGER,
                           treatment INTEGER,
                           trait_anxiety INTEGER);
    INSERT INTO patients VALUES
    (1,  1, 1, 70),
    (2,  1, 1, 80),
    (3,  1, 1, 50),
    (4,  1, 0, 60),
    (5,  1, 0, 40),
    (6,  1, 0, 65),
    (7,  1, 0, 75),
    (8,  1, 0, 80),
    (9,  1, 0, 70),
    (10, 1, 0, 60),
    (11, 0, 1, 65),
    (12, 0, 1, 50),
    (13, 0, 1, 45),
    (14, 0, 1, 35),
    (15, 0, 1, 40),
    (16, 0, 1, 50),
    (17, 0, 0, 55),
    (18, 0, 0, 45),
    (19, 0, 0, 50),
    (20, 0, 0, 60);
    
  2. Run logistic regression to get the model, compute the marginal effects of all variables, and view the results.
    DROP TABLE IF EXISTS model_table;
    DROP TABLE IF EXISTS model_table_summary;
    DROP TABLE IF EXISTS margins_table;
    SELECT madlib.logregr_train( 'patients',
                                 'model_table',
                                 'second_attack',
                                 'ARRAY[1, treatment, trait_anxiety, treatment * trait_anxiety]'
                               );
    SELECT madlib.margins( 'model_table',
                           'margins_table',
                           'intercept, treatment, trait_anxiety, treatment*trait_anxiety',
                           NULL,
                           NULL
                         );
    \x ON
    SELECT * FROM margins_table;
    
    Result:
    variables | {intercept, treatment, trait_anxiety}
    margins   | {-0.876046514609573,-0.0648833521465306,0.0177196513589633}
    std_err   | {0.551714275062467,0.373592457067442,0.00458001207971933}
    z_stats   | {-1.58786269307674,-0.173674149247659,3.86890930646828}
    p_values  | {0.112317391159946,0.862121554662231,0.000109323294026272}
    
  3. Compute the marginal effects of the first variable using the previous model and view the results (using different names in 'x_design').
    DROP TABLE IF EXISTS result_table;
    SELECT madlib.margins( 'model_table',
                           'result_table',
                           'i, tre, tra, tre*tra',
                           NULL,
                           'tre'
                         );
    SELECT * FROM result_table;
    
    Result:
    -[ RECORD 1 ]-------------------
    variables | {tre}
    margins   | {-0.110453283517281}
    std_err   | {0.228981529064089}
    z_stats   | {-0.482367656329023}
    p_values  | {0.629544793219806}
    
  4. Create a sample data set for multinomial logistic regression. (The full dataset has three categories.) Use the dataset from the Multinomial Regression example.
    \x OFF
    DROP TABLE IF EXISTS test3;
    CREATE TABLE test3 (
        feat1 INTEGER,
        feat2 INTEGER,
        cat INTEGER
    );
    INSERT INTO test3(feat1, feat2, cat) VALUES
    (1,35,1),
    (2,33,0),
    (3,39,1),
    (1,37,1),
    (2,31,1),
    (3,36,0),
    (2,36,1),
    (2,31,1),
    (2,41,1),
    (2,37,1),
    (1,44,1),
    (3,33,2),
    (1,31,1),
    (2,44,1),
    (1,35,1),
    (1,44,0),
    (1,46,0),
    (2,46,1),
    (2,46,2),
    (3,49,1),
    (2,39,0),
    (2,44,1),
    (1,47,1),
    (1,44,1),
    (1,37,2),
    (3,38,2),
    (1,49,0),
    (2,44,0),
    (3,61,2),
    (1,65,2),
    (3,67,1),
    (3,65,2),
    (1,65,2),
    (2,67,2),
    (1,65,2),
    (1,62,2),
    (3,52,2),
    (3,63,2),
    (2,59,2),
    (3,65,2),
    (2,59,0),
    (3,67,2),
    (3,67,2),
    (3,60,2),
    (3,67,2),
    (3,62,2),
    (2,54,2),
    (3,65,2),
    (3,62,2),
    (2,59,2),
    (3,60,2),
    (3,63,2),
    (3,65,2),
    (2,63,1),
    (2,67,2),
    (2,65,2),
    (2,62,2);
    
  5. Run the regression function and then compute the marginal effects of all variables in the regression.
    DROP TABLE IF EXISTS model_table;
    DROP TABLE IF EXISTS model_table_summary;
    DROP TABLE IF EXISTS result_table;
    SELECT madlib.mlogregr_train('test3', 'model_table', 'cat',
                                 'ARRAY[1, feat1, feat2, feat1*feat2]',
                                 0);
    SELECT madlib.margins('model_table',
                          'result_table',
                          'intercept, feat1, feat2, feat1*feat2');
    \x ON
    SELECT * FROM result_table;
    
    Result:
    -[ RECORD 1 ]+-------------------------------------------------------------
    category     | 1
    ref_category | 0
    variables    | {intercept,feat1,feat2}
    margins      | {2.38176571752675,-0.0545733108729351,-0.0147264917310351}
    std_err      | {0.851299967007829,0.0697049196489632,0.00374946341567828}
    z_stats      | {2.79779843748643,-0.782919070099622,-3.92762646235104}
    p_values     | {0.00514522099923651,0.43367463815468,8.57883141882439e-05}
    -[ RECORD 2 ]+-------------------------------------------------------------
    category     | 2
    ref_category | 0
    variables    | {intercept,feat1,feat2}
    margins      | {-1.99279068434949,0.0922540608068343,0.0168049205501686}
    std_err      | {0.742790306495022,0.0690712705200096,0.00202015384479213}
    z_stats      | {-2.68284422524683,1.33563578767686,8.31863404536785}
    p_values     | {0.00729989838349161,0.181668346802398,8.89828265128986e-17}
    

Note
The marginal_vars argument is a list with the names matching those in 'x_design'. If no 'x_design' is present (i.e. no interaction and no indicator variables), then marginal_vars must be the indices (base 1) of variables in 'independent_varname'. Use NULL to use all independent variables. It is important to note that the independent_varname array in the underlying regression is assumed to start with a lower bound index of 1. Arrays that don't follow this would result in an incorrect solution.

Technical Background

The standard approach to modeling dichotomous/binary variables (so \(y \in \{0, 1\} \)) is to estimate a generalized linear model under the assumption that \( y \) follows some form of Bernoulli distribution. Thus the expected value of \( y \) becomes,

\[ y = G(X' \beta), \]

where G is the specified binomial distribution. For logistic regression, the function \( G \) represents the inverse logit function.

In logistic regression:

\[ P = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots \beta_j x_j)}} = \frac{1}{1 + e^{-z}} \implies \frac{\partial P}{\partial X_k} = \beta_k \cdot \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ = \beta_k \cdot P \cdot (1-P) \]

There are several methods for calculating the marginal effects for dichotomous dependent variables. This package uses the average of the marginal effects at every sample observation.

This is calculated as follows:

\[ \frac{\partial y}{\partial x_k} = \beta_k \frac{\sum_{i=1}^n P(y_i = 1)(1-P(y_i = 1))}{n}, \\ \text{where}, P(y_i=1) = g(X^{(i)}\beta) \]

We use the delta method for calculating standard errors on the marginal effects.

Literature

[1] Marginal effects in Stata: https://www.stata.com/

Related Topics

File marginal.sql_in documenting the SQL functions.