Linear Regression
Collaboration diagram for Linear Regression:

Ordinary least-squares (OLS) linear regression refers to a stochastic model in which the conditional mean of the dependent variable (usually denoted $$Y$$) is an affine function of the vector of independent variables (usually denoted $$\boldsymbol x$$). That is,

$E[Y \mid \boldsymbol x] = \boldsymbol c^T \boldsymbol x$

for some unknown vector of coefficients $$\boldsymbol c$$. The assumption is that the residuals are i.i.d. distributed Gaussians. That is, the (conditional) probability density of $$Y$$ is given by

$f(y \mid \boldsymbol x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \cdot \exp\left(-\frac{1}{2 \sigma^2} \cdot (y - \boldsymbol x^T \boldsymbol c)^2 \right) \,.$

OLS linear regression finds the vector of coefficients $$\boldsymbol c$$ that maximizes the likelihood of the observations.

Let

• $$\boldsymbol y \in \mathbf R^n$$ denote the vector of observed dependent variables, with $$n$$ rows, containing the observed values of the dependent variable,
• $$X \in \mathbf R^{n \times k}$$ denote the design matrix with $$k$$ columns and $$n$$ rows, containing all observed vectors of independent variables. $$\boldsymbol x_i$$ as rows,
• $$X^T$$ denote the transpose of $$X$$,
• $$X^+$$ denote the pseudo-inverse of $$X$$.

Maximizing the likelihood is equivalent to maximizing the log-likelihood $$\sum_{i=1}^n \log f(y_i \mid \boldsymbol x_i)$$, which simplifies to minimizing the residual sum of squares $$RSS$$ (also called sum of squared residuals or sum of squared errors of prediction),

$RSS = \sum_{i=1}^n ( y_i - \boldsymbol c^T \boldsymbol x_i )^2 = (\boldsymbol y - X \boldsymbol c)^T (\boldsymbol y - X \boldsymbol c) \,.$

The first-order conditions yield that the $$RSS$$ is minimized at

$\boldsymbol c = (X^T X)^+ X^T \boldsymbol y \,.$

Computing the total sum of squares $$TSS$$, the explained sum of squares $$ESS$$ (also called the regression sum of squares), and the coefficient of determination $$R^2$$ is done according to the following formulas:

\begin{align*} ESS & = \boldsymbol y^T X \boldsymbol c - \frac{ \| y \|_1^2 }{n} \\ TSS & = \sum_{i=1}^n y_i^2 - \frac{ \| y \|_1^2 }{n} \\ R^2 & = \frac{ESS}{TSS} \end{align*}

Note: The last equality follows from the definition $$R^2 = 1 - \frac{RSS}{TSS}$$ and the fact that for linear regression $$TSS = RSS + ESS$$. A proof of the latter can be found, e.g., at: http://en.wikipedia.org/wiki/Sum_of_squares

We estimate the variance $$Var[Y - \boldsymbol c^T \boldsymbol x \mid \boldsymbol x]$$ as

$\sigma^2 = \frac{RSS}{n - k}$

and compute the t-statistic for coefficient $$i$$ as

$t_i = \frac{c_i}{\sqrt{\sigma^2 \cdot \left( (X^T X)^{-1} \right)_{ii} }} \,.$

The $$p$$-value for coefficient $$i$$ gives the probability of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( $$c_i = 0$$) is true. Letting $$F_\nu$$ denote the cumulative density function of student-t with $$\nu$$ degrees of freedom, the $$p$$-value for coefficient $$i$$ is therefore

$p_i = \Pr(|T| \geq |t_i|) = 2 \cdot (1 - F_{n - k}( |t_i| ))$

where $$T$$ is a student-t distributed random variable with mean 0.

The condition number [2] $$\kappa(X) = \|X\|_2\cdot\|X^{-1}\|_2$$ is computed as the product of two spectral norms [3]. The spectral norm of a matrix $$X$$ is the largest singular value of $$X$$ i.e. the square root of the largest eigenvalue of the positive-semidefinite matrix $$X^{*}X$$:

$\|X\|_2 = \sqrt{\lambda_{\max}\left(X^{*}X\right)}\ ,$

where $$X^{*}$$ is the conjugate transpose of $$X$$. The condition number of a linear regression problem is a worst-case measure of how sensitive the result is to small perturbations of the input. A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} sourceName (
...
dependentVariable FLOAT8,
independentVariables FLOAT8[],
...
)
Usage:

(1) The Simple Interface

• Get vector of coefficients $$\boldsymbol c$$ and all diagnostic statistics:
SELECT (madlib.linregr(dependentVariable,
independentVariables)).*
FROM sourceName;
Output:
coef | r2 | std_err | t_stats | p_values | condition_no
-----+----+---------+---------+----------+-------------
...

• Get vector of coefficients $$\boldsymbol c$$:
SELECT (madlib.linregr(dependentVariable,
independentVariables)).coef
FROM sourceName;
• Get a subset of the output columns, e.g., only the array of coefficients $$\boldsymbol c$$, the coefficient of determination $$R^2$$, and the array of p-values $$\boldsymbol p$$:
SELECT (lr).coef, (lr).r2, (lr).p_values
FROM (
independentVariables) AS lr
FROM sourceName
) AS subq;

(2) The Full Interface

The full interface support the analysis of heteroskedasticity of the linear fit.

SELECT madlib.linregr_train (
'source_table',        -- name of input table, VARCHAR
'out_table',           -- name of output table, VARCHAR
'dependent_varname',   -- dependent variable, VARCHAR
'independent_varname', -- independent variable, VARCHAR
[group_cols,           -- names of columns to group by, VARCHAR[].
-- Default value: Null
[heteroskedasticity_option]] -- whether to analyze
--    heteroskedasticity,
-- BOOLEAN. Default value: False
);


Here the 'independent_varname' can be the name of a column, which contains array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]', where x1, x2 and x3 are all column names.

Output is stored in the out_table:

[ group_col_1 | group_col_2 | ... |] coef | r2 | std_err | t_stats | p_values | condition_no [|
-----------+-------------+-----+------+----+---------+---------+----------+--------------+---
bp_stats | bp_p_value ]
-------------+---------


Where the first part

[ group_col_1 | group_col_2 | ... |]

presents only when group_cols is not Null. The last part

[ bp_stats | ... |
corrected_p_values ]

presents only when heteroskedasticity_option is True.

When group_cols is given, the data is grouped by the given columns and a linear model is fit to each group of data. The output will have additional columns for all combinations of the values of all the group_cols. For each combination of group_cols values, linear regression result is shown.

When heteroskedasticity_option is True, the output will have additional columns. The function computes the Breuschâ€“Pagan test [4] statistics and the corresponding $$p$$-value.

Examples:

The following example is taken from http://www.stat.columbia.edu/~martin/W2110/SAS_7.pdf.

1. Create the sample data set:
sql> CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
size INT, lot INT);
sql> COPY houses FROM STDIN WITH DELIMITER '|';
1 |  590 |       2 |    1 |  50000 |  770 | 22100
2 | 1050 |       3 |    2 |  85000 | 1410 | 12000
3 |   20 |       3 |    1 |  22500 | 1060 |  3500
4 |  870 |       2 |    2 |  90000 | 1300 | 17500
5 | 1320 |       3 |    2 | 133000 | 1500 | 30000
6 | 1350 |       2 |    1 |  90500 |  820 | 25700
7 | 2790 |       3 |  2.5 | 260000 | 2130 | 25000
8 |  680 |       2 |    1 | 142500 | 1170 | 22000
9 | 1840 |       3 |    2 | 160000 | 1500 | 19000
10 | 3680 |       4 |    2 | 240000 | 2790 | 20000
11 | 1660 |       3 |    1 |  87000 | 1030 | 17500
12 | 1620 |       3 |    2 | 118600 | 1250 | 20000
13 | 3100 |       3 |    2 | 140000 | 1760 | 38000
14 | 2070 |       2 |    3 | 148000 | 1550 | 14000
15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000
\.

2. You can call the linregr() function for an individual metric:
sql> SELECT (linregr(price, array[1, bedroom, bath, size])).coef FROM houses;
coef
------------------------------------------------------------------------
{27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).r2 FROM houses;
r2
-------------------
0.745374010140315
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).std_err FROM houses;
std_err
----------------------------------------------------------------------
{56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).t_stats FROM houses;
t_stats
------------------------------------------------------------------------
{0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).p_values FROM houses;
p_values
-----------------------------------------------------------------------------
{0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
(1 row)

3. Alternatively you can call the linreg() function for the full record:
sql> \x on
Expanded display is on.
sql> SELECT (r).* FROM (SELECT linregr(price, array[1, bedroom, bath, size])
AS r FROM houses) q;
-[ RECORD 1 ]+-----------------------------------------------------------------
coef         | {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
r2           | 0.745374010140315
std_err      | {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
t_stats      | {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
p_values     | {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
condition_no | 9783.018
4. You can call linregr_train() function for more functionality
sql> SELECT madlib.linregr_train('houses', 'result', 'price',
'array[1, tax, bath, size]',
'{bedroom}'::varchar[], True);

sql> SELECT * from result;
-[ RECORD 1]---------+-------------------------------------------------------
bedroom             | 2
coef                | {-84242.0345, 55.4430, -78966.9754, 225.6119}
r2                  | 0.9688
std_err             | {35019.00, 19.57, 23036.81, 49.04}
t_stats             | {-2.406, 2.833, -3.428, 4.600}
p_values            | {0.251, 0.216, 0.181, 0.136}
condition_no        | 10086.1
bp_stats            | 2.5451
bp_p_value          | 0.4672

-[ RECORD 2]---------+------------------------------------------------------
bedroom             | 3
coef                | {-88155.8292502747,27.1966436293179,41404.0293389239,62.6375210724027}
r2                  | 0.841699901312963
std_err             | {57867.9999699512,17.82723091538,43643.1321521931,70.8506824870639}
t_stats             | {-1.52339512850022,1.52556747362568,0.948695185179172,0.884077878626493}
p_values            | {0.18816143289241,0.187636685729725,0.38634003235866,0.417132778730133}
condition_no        | 11722.62
bp_stats            | 6.7538
bp_p_value          | 0.08017

-[ RECORD 3]---------+-------------------------------------------------------
bedroom             | 4
coef                | {0.0112536020318378,41.4132554771633,0.0225072040636757,31.3975496688276}
r2                  | 1
std_err             | {0,0,0,0}
t_stats             | {Infinity,Infinity,Infinity,Infinity}
p_values            | Null
condition_no        | Null
bp_stats            | Null
bp_p_value          | Null
Literature:

[1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 21 October 2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/17/lecture-17.pdf

[2] Wikipedia: Condition Number, http://en.wikipedia.org/wiki/Condition_number.

[3] Wikipedia: Spectral Norm, http://en.wikipedia.org/wiki/Spectral_norm#Spectral_norm

[4] Wikipedia: Breuschâ€“Pagan test, http://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test

[5] Wikipedia: Heteroscedasticity-consistent standard errors, http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors