MADlib
1.3 A newer version is available
User Documentation
|
The functions in this module calculate robust variance (Huber-White estimates) for linear regression, logistic regression, and multinomial logistic regression. They are useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".
The interfaces for robust linear, logistic, and multinomial logistic regression are similar. Each regression type has its own training function. The regression results are saved in an output table with small differences, depending on the regression type.
The robust_variance_linregr() function has the following syntax:
robust_variance_linregr( source_table, out_table, dependent_varname, independent_varname, grouping_cols )
{TABLE|VIEW} sourceName ( outputTable VARCHAR, regressionType VARCHAR, dependentVariable VARCHAR, independentVariable VARCHAR )
coef | DOUBLE PRECISION[]. Vector of the coefficients of the regression. |
---|---|
std_err | DOUBLE PRECISION[]. Vector of the standard error of the coefficients. |
t_stats | DOUBLE PRECISION[]. Vector of the t-stats of the coefficients. |
p_values | DOUBLE PRECISION[]. Vector of the p-values of the coefficients. |
The robust_variance_logregr() function has the following syntax:
robust_variance_logregr( source_table, out_table, dependent_varname, independent_varname, grouping_cols, max_iter, optimizer, tolerance, print_warnings )
coef | Vector of the coefficients of the regression. |
---|---|
std_err | Vector of the standard error of the coefficients. |
z_stats | Vector of the z-stats of the coefficients. |
p_values | Vector of the p-values of the coefficients. |
The robust_variance_mlogregr() function has the following syntax:
robust_variance_mlogregr( source_table, out_table, dependent_varname, independent_varname, ref_category, grouping_cols, max_iter, optimizer, tolerance, print_warnings )
ref_category | The refererence category used for modeling. |
---|---|
coef | Vector of the coefficients of the regression. |
std_err | Vector of the standard error of the coefficients. |
z_stats | Vector of the z-stats of the coefficients. |
p_values | Vector of the p-values of the coefficients. |
SELECT madlib.robust_variance_logregr();
DROP TABLE IF EXISTS patients; CREATE TABLE patients (id INTEGER NOT NULL, second_attack INTEGER, treatment INTEGER, trait_anxiety INTEGER); COPY patients FROM STDIN WITH DELIMITER '|'; 1 | 1 | 1 | 70 3 | 1 | 1 | 50 5 | 1 | 0 | 40 7 | 1 | 0 | 75 9 | 1 | 0 | 70 11 | 0 | 1 | 65 13 | 0 | 1 | 45 15 | 0 | 1 | 40 17 | 0 | 0 | 55 19 | 0 | 0 | 50 2 | 1 | 1 | 80 4 | 1 | 0 | 60 6 | 1 | 0 | 65 8 | 1 | 0 | 80 10 | 1 | 0 | 60 12 | 0 | 1 | 50 14 | 0 | 1 | 35 16 | 0 | 1 | 50 18 | 0 | 0 | 45 20 | 0 | 0 | 60 \.
DROP TABLE IF EXISTS patients_logregr; SELECT madlib.robust_variance_logregr( 'patients', 'patients_logregr', 'second_attack', 'ARRAY[1, treatment, trait_anxiety]' );
\x on Expanded display is on. SELECT * FROM patients_logregr;Result:
-[ RECORD 1 ]------------------------------------------------------- coef | {-6.36346994178179,-1.02410605239327,0.119044916668605} std_err | {3.45872062333648,1.1716192578234,0.0534328864185018} z_stats | {-1.83983346294192,-0.874094587943036,2.22793348156809} p_values | {0.0657926909738889,0.382066744585541,0.0258849510757339}Alternatively, unnest the arrays in the results for easier reading of output.
\x off SELECT unnest(array['intercept', 'treatment', 'trait_anxiety' ]) as attribute, unnest(coef) as coefficient, unnest(std_err) as standard_error, unnest(z_stats) as z_stat, unnest(p_values) as pvalue FROM patients_logregr;
When doing regression analysis, we are sometimes interested in the variance of the computed coefficients \( \boldsymbol c \). While the built-in regression functions provide variance estimates, we may prefer a robust variance estimate.
The robust variance calculation can be expressed in a sandwich formation, which is the form
\[ S( \boldsymbol c) = B( \boldsymbol c) M( \boldsymbol c) B( \boldsymbol c) \]
where \( B( \boldsymbol c)\) and \( M( \boldsymbol c)\) are matrices. The \( B( \boldsymbol c) \) matrix, also known as the bread, is relatively straight forward, and can be computed as
\[ B( \boldsymbol c) = n\left(\sum_i^n -H(y_i, x_i, \boldsymbol c) \right)^{-1} \]
where \( H \) is the hessian matrix.
The \( M( \boldsymbol c)\) matrix has several variations, each with different robustness properties. The form implemented here is the Huber-White sandwich operator, which takes the form
\[ M_{H} =\frac{1}{n} \sum_i^n \psi(y_i,x_i, \boldsymbol c)^T \psi(y_i,x_i, \boldsymbol c). \]
The above method for calculating robust variance (Huber-White estimates) is implemented for linear regression, logistic regression, and multinomial logistic regression. It is useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".
When multinomial logistic regression is computed before the multinomial robust regression, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as the multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \) . The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function marginal_mlogregr. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.
[1] vce(cluster) function in STATA: http://www.stata.com/help.cgi?vce_option
[2] clustered estimators in R: http://people.su.se/~ma/clustering.pdf
[3] Achim Zeileis: Object-oriented Computation of Sandwich Estimators. Research Report Series / Department of Statistics and Mathematics, 37. Department of Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna. http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf