Collaboration diagram for Clustered Variance:

About:

Adjusting standard errors for clustering can be important. For example, replicating a dataset 100 times should not increase the precision of parameter estimates. However, performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent. Clustering standard errors can correct for this.

Assume that the data can be separated into \(m\) clusters. Usually this can be done by grouping the data table according to one or multiple columns.

The estimator has a similar form to the usual sandwich estimator

\[ S(\vec{c}) = B(\vec{c}) M(\vec{c}) B(\vec{c}) \]

The bread part is the same as Huber-White sandwich estimator

\begin{eqnarray} B(\vec{c}) & = & \left(-\sum_{i=1}^{n} H(y_i, \vec{x}_i, \vec{c})\right)^{-1}\\ & = & \left(-\sum_{i=1}^{n}\frac{\partial^2 l(y_i, \vec{x}_i, \vec{c})}{\partial c_\alpha \partial c_\beta}\right)^{-1} \end{eqnarray}

where \(H\) is the hessian matrix, which is the second derivative of the target function

\[ L(\vec{c}) = \sum_{i=1}^n l(y_i, \vec{x}_i, \vec{c})\ . \]

The meat part is different

\[ M(\vec{c}) = \bf{A}^T\bf{A} \]

where the \(m\)-th row of \(\bf{A}\) is

\[ A_m = \sum_{i\in G_m}\frac{\partial l(y_i,\vec{x}_i,\vec{c})}{\partial \vec{c}} \]

where \(G_m\) is the set of rows that belong to the same cluster.

We can compute the quantities of \(B\) and \(A\) for each cluster during one scan through the data table in an aggregate function. Then sum over all clusters to the full \(B\) and \(A\) in the outside of the aggregate function. At last, the matrix mulplitications are done in a separate function on the master node.

When multinomial logistic regression is computed before the multinomial clustered variance calculation, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \). The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function marginal_mlogregr. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.

Usage:

Warning: The 'groupingCol' input parameter for all clustered functions are a placeholder, and the 'verbose' parameter is a placeholder for clustered_variance_mlogregr. These inputs will be implemented in a future release.

The clustered standard errors for linear regression

For a quick help message, run the following commands for linear regression

select madlib.clustered_variance_linregr();
select madlib.clustered_variance_linregr('help');
select madlib.clustered_variance_linregr('?');
select madlib.clustered_variance_linregr('usage');

For logistic regression, run the following commands to get short help messages inside psql

select madlib.clustered_variance_logregr();
select madlib.clustered_variance_logregr('help');
select madlib.clustered_variance_logregr('?');
select madlib.clustered_variance_logregr('usage');

For multinomial logistic regression, run the following commands to get short help messages inside psql

select madlib.clustered_variance_mlogregr();
select madlib.clustered_variance_mlogregr('help');
select madlib.clustered_variance_mlogregr('?');
select madlib.clustered_variance_mlogregr('usage');

SELECT madlib.clustered_variance_linregr (
    'tbl_data',    -- Data table name
    'tbl_output',  -- The result table
    'depvar',      -- An expression used as dependent variable
    'indvar',      -- An expression used as independent variable
    'clustervar',  -- The columns used as the cluster variables, separated by comma
    'groupingvar'  -- The columns used as the grouping variables, separated by comma
);

SELECT madlib.clustered_variance_logregr (
    'tbl_data',    -- Data table name
    'tbl_output',  -- The result table
    'depvar',      -- An expression used as dependent variable
    'indvar',      -- An expression used as independent variable
    'clustervar',  -- The columns used as the cluster variables, separated by comma
    'groupingvar', -- The columns used as the grouping variables, separated by comma
    max_iter,      -- Maximum iteration number for logistic regression, default 20
    'optimizer',   -- Optimization method for logistic regression, default 'irls'
    tolerance,     -- When difference of likelihoods in two consecutive iterations smaller than
                            -- this value, stops the computation. Default 0.0001
    verbose        -- Whether print detailed information when computing logistic regression,
                            -- default is False
);

SELECT madlib.clustered_variance_mlogregr (
    'tbl_data',    -- Data table name
    'tbl_output',  -- The result table
    'depvar',      -- An expression used as dependent variable
    'indvar',      -- An expression used as independent variable
    'clustervar',  -- The columns used as the cluster variables, separated by comma
    ref_category,  -- Reference category in the range of [0, num_category)
    'groupingvar', -- The columns used as the grouping variables, separated by comma
    max_iter,      -- Maximum iteration number for logistic regression, default 20
    'optimizer',   -- Optimization method for logistic regression, default 'irls'
    tolerance,     -- When difference of likelihoods in two consecutive iterations smaller than
                            -- this value, stops the computation. Default 0.0001
    verbose        -- Whether print detailed information when computing logistic regression,
                            -- default is False
);

Examples:

Note that we need to manually include an intercept term in the independent variable expression. The NULL value of groupingvar means that there is no grouping in the calculation.

sql> drop table if exists tbl_output;
sql> select madlib.clustered_variance_linregr ('abalone', 'tbl_output', 'rings', 'array[1, diameter, length, width]', 'sex', NULL);
sql> select * from tbl_output;
sql> ----------------------------------------------
sql> drop table if exists tbl_output;
sql> select madlib.clustered_variance_logregr ('abalone', 'tbl_output', 'rings < 10', 'array[1, diameter, length, width]', 'sex');
sql> select * from tbl_output;
sql> ----------------------------------------------
sql> drop table if exists tbl_output;
sql> select madlib.clustered_variance_mlogregr ('abalone', 'tbl_output', 'case when rings < 10 then 1 else 0 end', 'array[1, diameter, length, width]',  'sex', 0);
sql> select * from tbl_output;

Literature:

[1] Standard, Robust, and Clustered Standard Errors Computed in R, http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/

See Also: File clustered_variance.sql_in documenting the SQL function