Collaboration diagram for Linear Regression:

About:

Ordinary least-squares (OLS) linear regression refers to a stochastic model in which the conditional mean of the dependent variable (usually denoted \( Y \)) is an affine function of the vector of independent variables (usually denoted \( \boldsymbol x \)). That is,

\[ E[Y \mid \boldsymbol x] = \boldsymbol c^T \boldsymbol x \]

for some unknown vector of coefficients \( \boldsymbol c \). The assumption is that the residuals are i.i.d. distributed Gaussians. That is, the (conditional) probability density of \( Y \) is given by

\[ f(y \mid \boldsymbol x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \cdot \exp\left(-\frac{1}{2 \sigma^2} \cdot (y - \boldsymbol x^T \boldsymbol c)^2 \right) \,. \]

OLS linear regression finds the vector of coefficients \( \boldsymbol c \) that maximizes the likelihood of the observations.

Let

\( \boldsymbol y \in \mathbf R^n \) denote the vector of observed dependent variables, with \( n \) rows, containing the observed values of the dependent variable,
\( X \in \mathbf R^{n \times k} \) denote the design matrix with \( k \) columns and \( n \) rows, containing all observed vectors of independent variables. \( \boldsymbol x_i \) as rows,
\( X^T \) denote the transpose of \( X \),
\( X^+ \) denote the pseudo-inverse of \( X \).

Maximizing the likelihood is equivalent to maximizing the log-likelihood \( \sum_{i=1}^n \log f(y_i \mid \boldsymbol x_i) \), which simplifies to minimizing the residual sum of squares \( RSS \) (also called sum of squared residuals or sum of squared errors of prediction),

\[ RSS = \sum_{i=1}^n ( y_i - \boldsymbol c^T \boldsymbol x_i )^2 = (\boldsymbol y - X \boldsymbol c)^T (\boldsymbol y - X \boldsymbol c) \,. \]

The first-order conditions yield that the \( RSS \) is minimized at

\[ \boldsymbol c = (X^T X)^+ X^T \boldsymbol y \,. \]

Computing the total sum of squares \( TSS \), the explained sum of squares \( ESS \) (also called the regression sum of squares), and the coefficient of determination \( R^2 \) is done according to the following formulas:

\begin{align*} ESS & = \boldsymbol y^T X \boldsymbol c - \frac{ \| y \|_1^2 }{n} \\ TSS & = \sum_{i=1}^n y_i^2 - \frac{ \| y \|_1^2 }{n} \\ R^2 & = \frac{ESS}{TSS} \end{align*}

Note: The last equality follows from the definition \( R^2 = 1 - \frac{RSS}{TSS} \) and the fact that for linear regression \( TSS = RSS + ESS \). A proof of the latter can be found, e.g., at: http://en.wikipedia.org/wiki/Sum_of_squares

We estimate the variance \( Var[Y - \boldsymbol c^T \boldsymbol x \mid \boldsymbol x] \) as

\[ \sigma^2 = \frac{RSS}{n - k} \]

and compute the t-statistic for coefficient \( i \) as

\[ t_i = \frac{c_i}{\sqrt{\sigma^2 \cdot \left( (X^T X)^{-1} \right)_{ii} }} \,. \]

The \( p \)-value for coefficient \( i \) gives the probability of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F_\nu \) denote the cumulative density function of student-t with \( \nu \) degrees of freedom, the \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|T| \geq |t_i|) = 2 \cdot (1 - F_{n - k}( |t_i| )) \]

where \( T \) is a student-t distributed random variable with mean 0.

The condition number [2] \( \kappa(X) = \|X\|_2\cdot\|X^{-1}\|_2\) is computed as the product of two spectral norms [3]. The spectral norm of a matrix \(X\) is the largest singular value of \(X\) i.e. the square root of the largest eigenvalue of the positive-semidefinite matrix \(X^{*}X\):

\[ \|X\|_2 = \sqrt{\lambda_{\max}\left(X^{*}X\right)}\ , \]

where \(X^{*}\) is the conjugate transpose of \(X\). The condition number of a linear regression problem is a worst-case measure of how sensitive the result is to small perturbations of the input. A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} sourceName (
    ...
    dependentVariable FLOAT8,
    independentVariables FLOAT8[],
    ...
)

Usage:

(1) The Simple Interface

Get vector of coefficients \( \boldsymbol c \) and all diagnostic statistics:

SELECT (madlib.linregr(dependentVariable,
    independentVariables)).*
FROM sourceName;

Output:

coef | r2 | std_err | t_stats | p_values | condition_no
-----+----+---------+---------+----------+-------------
                          ...

Get vector of coefficients \( \boldsymbol c \):

SELECT (madlib.linregr(dependentVariable,
  independentVariables)).coef
FROM sourceName;

Get a subset of the output columns, e.g., only the array of coefficients \( \boldsymbol c \), the coefficient of determination \( R^2 \), and the array of p-values \( \boldsymbol p \):
```
SELECT (lr).coef, (lr).r2, (lr).p_values
FROM (
    SELECT madlib.linregr(dependentVariable,
      independentVariables) AS lr
    FROM sourceName
) AS subq;
```

(2) The Full Interface

The full interface support the analysis of heteroskedasticity of the linear fit.

SELECT madlib.linregr_train (
    'source_table',        -- name of input table, VARCHAR
    'out_table',           -- name of output table, VARCHAR
    'dependent_varname',   -- dependent variable, VARCHAR
    'independent_varname', -- independent variable, VARCHAR
    [group_cols,           -- names of columns to group by, VARCHAR[].
                                    -- Default value: Null
    [heteroskedasticity_option]] -- whether to analyze
                                          --    heteroskedasticity,
                                          -- BOOLEAN. Default value: False
);

Here the 'independent_varname' can be the name of a column, which contains array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]', where x1, x2 and x3 are all column names.

Output is stored in the out_table:

[ group_col_1 | group_col_2 | ... |] coef | r2 | std_err | t_stats | p_values | condition_no [|
-----------+-------------+-----+------+----+---------+---------+----------+--------------+---

bp_stats | bp_p_value ]
-------------+---------

Where the first part

[ group_col_1 | group_col_2 | ... |]

presents only when group_cols is not Null. The last part

[ bp_stats | ... |
corrected_p_values ]

presents only when heteroskedasticity_option is True.

When group_cols is given, the data is grouped by the given columns and a linear model is fit to each group of data. The output will have additional columns for all combinations of the values of all the group_cols. For each combination of group_cols values, linear regression result is shown.

When heteroskedasticity_option is True, the output will have additional columns. The function computes the Breusch–Pagan test [4] statistics and the corresponding \(p\)-value.

Examples:

The following example is taken from http://www.stat.columbia.edu/~martin/W2110/SAS_7.pdf.

Create the sample data set:

sql> CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
            size INT, lot INT);
sql> COPY houses FROM STDIN WITH DELIMITER '|';
  1 |  590 |       2 |    1 |  50000 |  770 | 22100
  2 | 1050 |       3 |    2 |  85000 | 1410 | 12000
  3 |   20 |       3 |    1 |  22500 | 1060 |  3500
  4 |  870 |       2 |    2 |  90000 | 1300 | 17500
  5 | 1320 |       3 |    2 | 133000 | 1500 | 30000
  6 | 1350 |       2 |    1 |  90500 |  820 | 25700
  7 | 2790 |       3 |  2.5 | 260000 | 2130 | 25000
  8 |  680 |       2 |    1 | 142500 | 1170 | 22000
  9 | 1840 |       3 |    2 | 160000 | 1500 | 19000
 10 | 3680 |       4 |    2 | 240000 | 2790 | 20000
 11 | 1660 |       3 |    1 |  87000 | 1030 | 17500
 12 | 1620 |       3 |    2 | 118600 | 1250 | 20000
 13 | 3100 |       3 |    2 | 140000 | 1760 | 38000
 14 | 2070 |       2 |    3 | 148000 | 1550 | 14000
 15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000
\.

You can call the linregr() function for an individual metric:

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).coef FROM houses;
                                  coef
------------------------------------------------------------------------
 {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).r2 FROM houses;
        r2
-------------------
 0.745374010140315
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).std_err FROM houses;
                               std_err
----------------------------------------------------------------------
 {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).t_stats FROM houses;
                                t_stats
------------------------------------------------------------------------
 {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
(1 row)

sql> SELECT (linregr(price, array[1, bedroom, bath, size])).p_values FROM houses;
                                  p_values
-----------------------------------------------------------------------------
 {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
(1 row)

Alternatively you can call the linreg() function for the full record:

sql> \x on
Expanded display is on.
sql> SELECT (r).* FROM (SELECT linregr(price, array[1, bedroom, bath, size])
            AS r FROM houses) q;
-[ RECORD 1 ]+-----------------------------------------------------------------
coef         | {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
r2           | 0.745374010140315
std_err      | {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
t_stats      | {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
p_values     | {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
condition_no | 9783.018

You can call linregr_train() function for more functionality

sql> SELECT madlib.linregr_train('houses', 'result', 'price',
                                 'array[1, tax, bath, size]',
                                 '{bedroom}'::varchar[], True);

sql> SELECT * from result;
-[ RECORD 1]---------+-------------------------------------------------------
bedroom             | 2
coef                | {-84242.0345, 55.4430, -78966.9754, 225.6119}
r2                  | 0.9688
std_err             | {35019.00, 19.57, 23036.81, 49.04}
t_stats             | {-2.406, 2.833, -3.428, 4.600}
p_values            | {0.251, 0.216, 0.181, 0.136}
condition_no        | 10086.1
bp_stats            | 2.5451
bp_p_value          | 0.4672

-[ RECORD 2]---------+------------------------------------------------------
bedroom             | 3
coef                | {-88155.8292502747,27.1966436293179,41404.0293389239,62.6375210724027}
r2                  | 0.841699901312963
std_err             | {57867.9999699512,17.82723091538,43643.1321521931,70.8506824870639}
t_stats             | {-1.52339512850022,1.52556747362568,0.948695185179172,0.884077878626493}
p_values            | {0.18816143289241,0.187636685729725,0.38634003235866,0.417132778730133}
condition_no        | 11722.62
bp_stats            | 6.7538
bp_p_value          | 0.08017

-[ RECORD 3]---------+-------------------------------------------------------
bedroom             | 4
coef                | {0.0112536020318378,41.4132554771633,0.0225072040636757,31.3975496688276}
r2                  | 1
std_err             | {0,0,0,0}
t_stats             | {Infinity,Infinity,Infinity,Infinity}
p_values            | Null
condition_no        | Null
bp_stats            | Null
bp_p_value          | Null

Literature:

[1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 21 October 2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/17/lecture-17.pdf

[2] Wikipedia: Condition Number, http://en.wikipedia.org/wiki/Condition_number.

[3] Wikipedia: Spectral Norm, http://en.wikipedia.org/wiki/Spectral_norm#Spectral_norm

[4] Wikipedia: Breusch–Pagan test, http://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test

[5] Wikipedia: Heteroscedasticity-consistent standard errors, http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors

See Also: File linear.sql_in documenting the SQL functions.