MADlib  1.3 A newer version is available User Documentation Linear Regression
Contents

Ordinary Least Squares Regression, also called Linear Regression, is a statistical model used to fit linear models.

It models a linear relationship of a scalar dependent variable $$y$$ to one or more explanatory independent variables $$x$$ to build a model of coefficients.

Training Function

The linear regression training function has the following syntax.

linregr_train( source_table,
out_table,
dependent_varname,
independent_varname,
input_group_cols,
heteroskedasticity_option
)


Arguments

source_table

TEXT. The name of the table containing the training data.

out_table

TEXT. Name of the generated table containing the output model.

The output table contains the following columns.

<...> Any grouping columns provided during training. Present only if the grouping option is used. FLOAT8[]. Vector of the coefficients of the regression. FLOAT8. R-squared coefficient of determination of the model. FLOAT8[]. Vector of the standard error of the coefficients. FLOAT8[]. Vector of the t-statistics of the coefficients. FLOAT8[]. Vector of the p-values of the coefficients. FLOAT8 array. The condition number of the $$X^{*}X$$ matrix. A high condition number is usually an indication that there may be some numeric instability in the result yielding a less reliable model. A high condition number often results when there is a significant amount of colinearity in the underlying design matrix, in which case other regression techniques, such as elastic net regression, may be more appropriate. FLOAT8. The Breush-Pagan statistic of heteroskedacity. Present only if the heteroskedacity argument was set to True when the model was trained. FLOAT8. The Breush-Pagan calculated p-value. Present only if the heteroskedacity parameter was set to True when the model was trained.

dependent_varname

TEXT. Expression to evaluate for the dependent variable.

independent_varname

TEXT. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list.

input_group_cols (optional)

TEXT, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated.

heteroskedasticity_option (optional)
BOOLEAN, default: FALSE. When TRUE, the heteroskedasticity of the model is also calculated and returned with the results.

Warning
The aggregate 'linregr' has been deprecated in favor of the function 'linregr_train'. If the aggregate 'linregr' is used to output the results of linear regression to a table, it is recommended to follow the general pattern shown below (replace text within '<...>' with the appropriate variable names).
CREATE TABLE <output table> AS
SELECT (r).*
FROM (
SELECT linregr(<dependent variable>, <independent variable>) as r
FROM <source table>
) q;


Prediction Function
linregr_predict( coef,
col_ind
)

Arguments
coef

FLOAT8[]. Vector of the coefficients of regression.

col_ind

FLOAT8[]. An array containing the independent variable column names.

Examples
1. Create an input data set.
CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
size INT, lot INT);
COPY houses FROM STDIN WITH DELIMITER '|';
1 |  590 |       2 |    1 |  50000 |  770 | 22100
2 | 1050 |       3 |    2 |  85000 | 1410 | 12000
3 |   20 |       3 |    1 |  22500 | 1060 |  3500
4 |  870 |       2 |    2 |  90000 | 1300 | 17500
5 | 1320 |       3 |    2 | 133000 | 1500 | 30000
6 | 1350 |       2 |    1 |  90500 |  820 | 25700
7 | 2790 |       3 |  2.5 | 260000 | 2130 | 25000
8 |  680 |       2 |    1 | 142500 | 1170 | 22000
9 | 1840 |       3 |    2 | 160000 | 1500 | 19000
10 | 3680 |       4 |    2 | 240000 | 2790 | 20000
11 | 1660 |       3 |    1 |  87000 | 1030 | 17500
12 | 1620 |       3 |    2 | 118600 | 1250 | 20000
13 | 3100 |       3 |    2 | 140000 | 1760 | 38000
14 | 2070 |       2 |    3 | 148000 | 1550 | 14000
15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000
\.

2. Train a regression model. First, a single regression for all the data.
SELECT madlib.linregr_train( 'houses',
'houses_linregr',
'price',
'ARRAY[1, tax, bath, size]'
);

3. Generate three output models, one for each value of "bedroom".
SELECT madlib.linregr_train( 'houses',
'houses_linregr_bedroom',
'price',
'ARRAY[1, tax, bath, size]',
'bedroom'
);

4. Examine the resulting models.
-- Set extended display on for easier reading of output
\x ON
SELECT * FROM houses_linregr;

Result:
-[ RECORD 1 ]+---------------------------------------------------------------------------
coef         | {-12849.4168959872,28.9613922651765,10181.6290712648,50.516894915354}
r2           | 0.768577580597443
std_err      | {33453.0344331391,15.8992104963997,19437.7710925923,32.928023174087}
t_stats      | {-0.38410317968819,1.82156166004184,0.523806408809133,1.53416118083605}
p_values     | {0.708223134615422,0.0958005827189772,0.610804093526536,0.153235085548186}
condition_no | 9002.50457085737

5. View the results grouped by bedroom.
SELECT * FROM houses_linregr_bedroom;

Result:
-[ RECORD 1 ]+--------------------------------------------------------------------------
bedroom      | 2
coef         | {-84242.0345406597,55.4430144648696,-78966.9753675319,225.611910021192}
r2           | 0.968809546465313
std_err      | {35018.9991665742,19.5731125320686,23036.8071292552,49.0448678148784}
t_stats      | {-2.40560942761235,2.83261103077151,-3.42786111480046,4.60011251070697}
p_values     | {0.250804617665239,0.21605133377602,0.180704400437373,0.136272031474122}
condition_no | 10086.1048721726
-[ RECORD 2 ]+--------------------------------------------------------------------------
bedroom      | 4
coef         | {0.0112536020318378,41.4132554771633,0.0225072040636757,31.3975496688276}
r2           | 1
std_err      | {0,0,0,0}
t_stats      | {Infinity,Infinity,Infinity,Infinity}
p_values     |
condition_no | Infinity
-[ RECORD 3 ]+--------------------------------------------------------------------------
bedroom      | 3
coef         | {-88155.8292501601,27.1966436294429,41404.0293363612,62.637521075324}
r2           | 0.841699901311252
std_err      | {57867.9999702625,17.8272309154689,43643.1321511114,70.8506824863954}
t_stats      | {-1.52339512849005,1.52556747362508,0.948695185143966,0.884077878676067}
p_values     | {0.188161432894871,0.187636685729869,0.386340032374927,0.417132778705789}
condition_no | 11722.6225642147

Alternatively you can unnest the results for easier reading of output.
\x OFF
SELECT unnest(ARRAY['intercept','tax','bath','size']) as attribute,
unnest(coef) as coefficient,
unnest(std_err) as standard_error,
unnest(t_stats) as t_stat,
unnest(p_values) as pvalue
FROM houses_linregr;

6. Use the prediction function to evaluate residuals.
SELECT houses.*,
madlib.linregr_predict( ARRAY[1,tax,bath,size],
m.coef
) as predict,
price -
madlib.linregr_predict( ARRAY[1,tax,bath,size],
m.coef
) as residual
FROM houses, houses_linregr m;