MADlib
1.3 A newer version is available
User Documentation
|
The Clustered Variance module adjusts standard errors for clustering. For example, replicating a dataset 100 times should not increase the precision of parameter estimates, but performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent. Clustering standard errors can correct for this.
The MADlb Clustered Variance module includes functions to calculate linear, logistic, and multinomial logistic regression problems.
The clustered variance linear regression training function has the following syntax.
clustered_variance_linregr ( tbl_data, tbl_output, depvar, indvar, clustervar, groupingvar )
Arguments
The clustered variance logistic regression training function has the following syntax.
clustered_variance_logregr( tbl_data, tbl_output, depvar, indvar, clustervar, groupingvar, max_iter, optimizer, tolerance, verbose )
Arguments
clustered_variance_mlogregr( tbl_data, tbl_output, depvar, indvar, clustervar, ref_category, groupingvar, max_iter, optimizer, tolerance, verbose )
Arguments
SELECT madlib.clustered_variance_linregr();
DROP TABLE IF EXISTS tbl_output; SELECT madlib.clustered_variance_linregr( 'abalone', 'tbl_output', 'rings', 'ARRAY[1, diameter, length, width]', 'sex', NULL ); SELECT * FROM tbl_output;
SELECT madlib.clustered_variance_logregr();
DROP TABLE IF EXISTS tbl_output; SELECT madlib.clustered_variance_logregr( 'abalone', 'tbl_output', 'rings < 10', 'ARRAY[1, diameter, length, width]', 'sex' ); SELECT * FROM tbl_output;
Assume that the data can be separated into \(m\) clusters. Usually this can be done by grouping the data table according to one or multiple columns.
The estimator has a similar form to the usual sandwich estimator
\[ S(\vec{c}) = B(\vec{c}) M(\vec{c}) B(\vec{c}) \]
The bread part is the same as Huber-White sandwich estimator
\begin{eqnarray} B(\vec{c}) & = & \left(-\sum_{i=1}^{n} H(y_i, \vec{x}_i, \vec{c})\right)^{-1}\\ & = & \left(-\sum_{i=1}^{n}\frac{\partial^2 l(y_i, \vec{x}_i, \vec{c})}{\partial c_\alpha \partial c_\beta}\right)^{-1} \end{eqnarray}
where \(H\) is the hessian matrix, which is the second derivative of the target function
\[ L(\vec{c}) = \sum_{i=1}^n l(y_i, \vec{x}_i, \vec{c})\ . \]
The meat part is different
\[ M(\vec{c}) = \bf{A}^T\bf{A} \]
where the \(m\)-th row of \(\bf{A}\) is
\[ A_m = \sum_{i\in G_m}\frac{\partial l(y_i,\vec{x}_i,\vec{c})}{\partial \vec{c}} \]
where \(G_m\) is the set of rows that belong to the same cluster.
We can compute the quantities of \(B\) and \(A\) for each cluster during one scan through the data table in an aggregate function. Then sum over all clusters to the full \(B\) and \(A\) in the outside of the aggregate function. At last, the matrix mulplitications are done in a separate function on the master node.
When multinomial logistic regression is computed before the multinomial clustered variance calculation, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \). The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function marginal_mlogregr. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.
[1] Standard, Robust, and Clustered Standard Errors Computed in R, http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/