User Documentation
 All Files Functions Groups
pca.sql_in
Go to the documentation of this file.
1 /* ----------------------------------------------------------------------- *//**
2  *
3  * @file pca.sql_in
4  *
5  * @brief Principal Component Analysis
6  *
7  * @sa For a brief introduction to Principal Component Analysis, see the module
8  * description \ref grp_pca.
9  *
10  *//* ----------------------------------------------------------------------- */
11 
12 m4_include(`SQLCommon.m4')
13 
14 /**
15 
16 @addtogroup grp_pca_train
17 
18 <div class ="toc"><b>Contents</b>
19 <ul>
20 <li class="level1"><a href="#pca_train">About</a></li>
21 <li class="level1"><a href="#help">Online Help</a></li>
22 <li class="level1"><a href="#train">Training Function</a></li>
23 <li class="level1"><a href="#output">Output Tables</a></li>
24 <li class="level1"><a href="#examples">Examples</a></li>
25 <li class="level1"><a href="#seealso">See Also</a></li>
26 <li class="level1"><a href="#background_pca">Technical Background</a></li>
27 <li class="level1"><a href="#literature">Literature</a></li>
28 </ul>
29 </div>
30 
31 @anchor pca_train
32 @about
33 Principal component analysis (PCA) is a mathematical procedure that uses an
34 orthogonal transformation to convert a set of observations of possibly
35 correlated variables into a set of values of linearly uncorrelated variables
36 called principal components. This transformation is defined in such a way that
37 the first principal component has the largest possible variance (i.e.,
38 accounts for as much of the variability in the data as possible), and each
39 succeeding component in turn has the highest variance possible under the
40 constraint that it be orthogonal to (i.e., uncorrelated with) the preceding
41 components.
42 
43 See the \ref background_pca "Technical Background" for an introduction to
44  principal component analysis and the implementation notes.
45 
46 @anchor help
47 @par Online Help
48 View short help messages using the following statements:
49 @verbatim
50 -- Summary of PCA projection
51 madlib.pca_train()
52 madlib.pca_train('?')
53 madlib.pca_train('help')
54 
55 -- Training function syntax and output table format
56 madlib.pca_train('usage')
57 
58 -- Summary of PCA projection with sparse matrices
59 madlib.pca_sparse_train()
60 madlib.pca_sparse_train('?')
61 madlib.pca_sparse_train('help')
62 
63 -- Training function syntax and output table format
64 madlib.pca_sparse_train('usage')
65 @endverbatim
66 
67 @anchor train
68 @par Training Function
69 The training functions have the following formats:
70 @verbatim
71 pca_project( source_table, out_table, row_id,
72  k, grouping_cols:= NULL,
73  lanczos_iter := min(k+40, <smallest_matrix_dimension>),
74  use_correlation := False, result_summary_table := NULL)
75 @endverbatim
76 and
77 @verbatim
78 pca_sparse_project(source_table, out_table,
79  row_id, col_id, val_id, row_dim, col_dim, k,
80  grouping_cols := NULL,
81  lanczos_iter := min(k+40, <smallest_matrix_dimension>),
82  use_correlation := False, result_summary_table := NULL)
83 @endverbatim
84 
85 \note Because of the centering step in PCA (see
86 \ref background_pca "Technical Background"), sparse matrices almost always
87 become dense during the training process. Thus, this implementation
88 automatically densifies sparse matrix input, and there should be no expected
89  performance improvement in using sparse matrix input over dense matrix input.
90 
91 @par Arguments
92 \par
93 <DL class="arglist">
94 <DT>source_table</DT>
95 <DD>Text value. Name of the input table containing the data for PCA training.
96 The input data matrix should have \f$ N \f$ rows
97 and \f$ M \f$ columns, where \f$ N \f$ is the number of data points, and \f$ M
98 \f$ is the number of features for each data point.
99 
100 A dense input table is expected to be in the one of the
101 two standard MADlib dense matrix formats, and a sparse input table
102  should be in the standard MADlib sparse matrix format.
103 
104 The two standard MADlib dense matrix formats are
105 <pre>{TABLE|VIEW} <em>source_table</em> (
106  <em>row_id</em> INTEGER,
107  row_vec FLOAT8[],
108 )</pre>
109 and
110 <pre>{TABLE|VIEW} <em>source_table</em> (
111  <em>row_id</em> INTEGER,
112  col1 FLOAT8,
113  col2 FLOAT8,
114  ...
115 )</pre>
116 
117 Note that the column name <em>row_id</em> is taken as an input parameter,
118  and should contain a list of row indices (starting at 0) for the input matrix.
119 
120 The input table for sparse PCA is expected to be in the form:
121 
122 <pre>{TABLE|VIEW} <em>source_table</em> (
123  ...
124  <em>row_id</em> INTEGER,
125  <em>col_id</em> INTEGER,
126  <em>val_id</em> FLOAT8,
127  ...
128 )</pre>
129 
130 The <em>row_id</em> and <em>col_id</em> columns specify which entries
131 in the matrix are nonzero, and the <em>val_id</em> column defines the values
132 of the nonzero entries.
133 </DD>
134 
135 <DT>out_table</DT>
136 <DD>Text value. Name of the table that will contain the principal components of the input data.</DD>
137 
138 <DT>row_id</DT>
139 <DD>Text value. Column name containing the row IDs in the input source table.</DD>
140 
141 <DT>col_id</DT>
142 <DD>Text value. Name of 'col_id' column in sparse matrix representation (sparse matrices only). </DD>
143 
144 <DT>val_id</DT>
145 <DD>Text value. Name of 'val_id' column in sparse matrix representation (sparse matrices only). </DD>
146 
147 <DT>row_dim</DT>
148 <DD>Integer value. The number of rows in the sparse matrix (sparse matrices only). </DD>
149 
150 <DT>col_dim</DT>
151 <DD>Integer value. The number of columns in the sparse matrix (sparse matrices only). </DD>
152 
153 <DT>k</DT>
154 <DD>Integer value. The number of principal components to calculate from the input data. </DD>
155 
156 <DT>grouping_cols</DT>
157 <DD>Text value. Currently <em>grouping_cols</em> is present as a placeholder for forward
158  compatibility. The parameter is planned to be implemented as a comma-separated
159  list of column names, with the source data grouped using the combination of all the columns.
160  An independent PCA model will be computed for each combination of the grouping columns. Default: NULL.</DD>
161 
162 <DT>lanczos_iter</DT>
163 <DD>Integer value. The number of Lanczos iterations for the SVD calculation.
164 The Lanczos iteration number roughly corresponds to the accuracy of the SVD
165 calculation, and a higher iteration number corresponds to greater accuracy
166 but longer computation time. The number of iterations must be at least as
167 large as the value of <em>k</em>, but no larger than the smallest dimension
168  of the matrix. If the iteration number is given as zero, then the default
169  number of iterations is used.
170  Default: minimum of {k+40, smallest matrix dimension}.</DD>
171 
172 <DT>use_correlation</DT>
173 <DD>Boolean value. Whether to use the correlation matrix for calculating the principal components instead of the covariance matrix. Currently
174 <em>use_correlation</em> is a placeholder for forward compatibility, and
175 this value must be set to false. Default: False. </DD>
176 
177 <DT>result_summary_table</DT>
178 <DD>Text value. Name of the optional summary table. Default: NULL.</DD>
179 </DL>
180 
181 @anchor output
182 @par Output Tables
183 
184 The output is divided into three tables (one of which is optional).
185 The output table (<em>'out_table'</em> above) encodes the principal components with the
186 
187 <em>k</em> highest eigenvalues. The table has the following columns:
188 \par
189 <DL class="arglist">
190 <DT>row_id</DT>
191 <DD>Eigenvalue rank in descending order of the eigenvalue size.</DD>
192 
193 <DT>principal_components</DT>
194 <DD>Vectors containing elements of the principal components.</DD>
195 
196 <DT>eigen_values</DT>
197 <DD>The eigenvalues associated with each principal component.</DD>
198 </DL>
199 
200 In addition to the output table, a table containing the column means is also generated.
201 This table has the same name as the output table, with the string "_mean" appended to the end.
202 This table has only one column:
203 \par
204 <DL class="arglist">
205 <DT>column_mean</DT>
206 <DD> A vector containing the column means for the input matrix.</DD>
207 </DL>
208 
209 The optional summary table contains information about the performance of the PCA.
210 This table has the following columns:
211 \par
212 <DL class="arglist">
213 <DT>rows_used</DT>
214 <DD> Number of data points in the input.</DD>
215 <DT>exec_time (ms)</DT>
216 <DD>Number of milliseconds for the PCA calculation to run.</DD>
217 <DT>iter</DT>
218 <DD>Number of iterations used in the SVD calculation. </DD>
219 <DT>recon_error</DT>
220 <DD>The absolute error in the SVD approximation. </DD>
221 <DT>relative_recon_error</DT>
222 <DD>The relative error in the SVD approximation. </DD>
223 <DT>use_correlation</DT>
224 <DD>Indicates if the correlation matrix was used. </DD>
225 </DL>
226 
227 @anchor examples
228 @examp
229 -# Create the sample data.
230 @verbatim
231 sql> DROP TABLE IF EXISTS mat;
232 CREATE TABLE mat (
233  row_id integer,
234  row_vec double precision[]
235 );
236 
237 sql> COPY mat (row_id, row_vec) FROM stdin;
238 0 {1,2,3}
239 1 {2,1,2}
240 2 {3,2,1}
241 \.
242 
243 @endverbatim
244 -# Run the PCA function:
245 @verbatim
246 sql> drop table result_table;
247 sql> select pca_train(
248  'mat', -- name of the input table
249  'result_table', -- name of the output table
250  'row_id', -- column containing the matrix indices
251  3 -- Number of PCA components to compute
252 );
253 @endverbatim
254 -# View the PCA results:
255 @verbatim
256 sql> SELECT * from result_table;
257  row_id | principal_components | eigen_values
258 --------+--------------------------------------------------------------+----------------------
259  0 | {0.707106781186547,0.408248290459781,-0.577350269192513} | 2
260  2 | {-0.707106781186547,0.408248290459781,-0.577350269192512} | 1.26294130828989e-08
261  1 | {2.08166817117217e-17,-0.816496580931809,-0.577350269183852} | 0.816496580927726
262 
263 @endverbatim
264 
265 @anchor seealso
266 @sa File pca.sql_in documenting the SQL functions
267 @sa grp_pca_project
268 
269 @anchor background_pca
270 @par Technical Background
271 
272 The PCA implemented here uses an SVD decomposition implementation to recover
273 the principle components (as opposed to the directly computing the eigenvectors
274  of the covariance matrix). Let \f$ \boldsymbol X \f$ be the data matrix, and
275  let \f$ \hat{x} \f$ be a vector of the column averages of \f$ \boldsymbol{X}\f$.
276  PCA computes the matrix \f$ \hat{\boldsymbol X} \f$ as
277 \f[
278 \hat{\boldsymbol X} = {\boldsymbol X} - \vec{e} \hat{x}^T
279 \f]
280 where \f$ \vec{e} \f$ is the vector of all ones.
281 
282 PCA then computes the SVD matrix factorization
283  \f[
284 \hat{\boldsymbol X} = {\boldsymbol U}{\boldsymbol \Sigma}{\boldsymbol V}^T
285 \f]
286 where \f$ {\boldsymbol \Sigma} \f$ is a diagonal matrix. The eigenvalues are
287 recovered as the entries of \f$ {\boldsymbol \Sigma}/(\sqrt{N-1}) \f$, and the principle
288 components are the rows of \f$ {\boldsymbol V} \f$.
289 
290 
291 It is important to note that the PCA implementation assumes that the user will
292  use only the principle components that have non-zero eigenvalues. The SVD
293  calculation is done with the Lanczos method, with does not guarantee
294  correctness for singular vectors with zero-valued eigenvalues. Consequently,
295  principle components with zero-valued eigenvalues are not guaranteed to be correct.
296  Generally, this will not be problem unless the user wants to use the
297  principle components for the entire eigenspectrum.
298 
299 
300 
301 
302 @anchor literature
303 @literature
304 
305 [1] Principal Component Analysis. http://en.wikipedia.org/wiki/Principal_component_analysis
306 
307 [2] Shlens, Jonathon (2009), A Tutorial on Principal Component Analysis
308 
309 **/
310 
311 -- -----------------------------------------------------------------------
312 -- PCA for Dense matrices
313 -- -----------------------------------------------------------------------
314 /*
315 @brief Compute principal components for a dense matrix stored in a
316  database table
317 */
318 CREATE OR REPLACE FUNCTION
319 MADLIB_SCHEMA.pca_train(
320  source_table TEXT, -- Source table name (dense matrix)
321  pc_table TEXT, -- Output table name for the principal components
322  row_id TEXT, -- Column name for the ID for each row
323  k INTEGER, -- Number of principal components to compute
324  grouping_cols TEXT, -- Comma-separated list of grouping columns (Default: NULL)
325  lanczos_iter INTEGER, -- The number of Lanczos iterations for the SVD calculation (Default: min(k+40, smallest Matrix dimension))
326  use_correlation BOOLEAN, -- If True correlation matrix is used for principal components (Default: False)
327  result_summary_table TEXT -- Table name to store summary of results (Default: NULL)
328 )
329 RETURNS VOID AS $$
330 PythonFunction(pca, pca, pca)
331 $$ LANGUAGE plpythonu;
332 
333 -- Overloaded functions for optional parameters
334 -- -----------------------------------------------------------------------
335 
336 
337 CREATE OR REPLACE FUNCTION
338 MADLIB_SCHEMA.pca_train(
339  source_table TEXT, -- Source table name (dense matrix)
340  pc_table TEXT, -- Output table name for the principal components
341  row_id TEXT, -- Column name for the ID for each row
342  k INTEGER,-- Number of principal components to compute
343  grouping_cols TEXT, -- Comma-separated list of grouping columns
344  lanczos_iter INTEGER,-- The number of Lanczos iterations for the SVD calculation
345  use_correlation BOOLEAN -- If True correlation matrix is used for principal components
346 )
347 RETURNS VOID AS $$
348  SELECT MADLIB_SCHEMA.pca_train($1, $2, $3, $4, $5, $6, $7, NULL)
349 $$ LANGUAGE SQL;
350 
351 CREATE OR REPLACE FUNCTION
352 MADLIB_SCHEMA.pca_train(
353  source_table TEXT, -- Source table name (dense matrix)
354  pc_table TEXT, -- Output table name for the principal components
355  row_id TEXT, -- Column name for the ID for each row
356  k INTEGER,-- Number of principal components to compute
357  grouping_cols TEXT, -- Comma-separated list of grouping columns
358  lanczos_iter INTEGER -- The number of Lanczos iterations for the SVD calculation
359 )
360 RETURNS VOID AS $$
361  SELECT MADLIB_SCHEMA.pca_train($1, $2, $3, $4, $5, $6, False , NULL)
362 $$ LANGUAGE SQL;
363 
364 CREATE OR REPLACE FUNCTION
365 MADLIB_SCHEMA.pca_train(
366  source_table TEXT, -- Source table name (dense matrix)
367  pc_table TEXT, -- Output table name for the principal components
368  row_id TEXT, -- Column name for the ID for each row
369  k INTEGER,-- Number of principal components to compute
370  grouping_cols TEXT -- Comma-separated list of grouping columns
371 )
372 RETURNS VOID AS $$
373  SELECT MADLIB_SCHEMA.pca_train($1, $2, $3, $4, $5, 0, False , NULL)
374 $$ LANGUAGE SQL;
375 
376 
377 CREATE OR REPLACE FUNCTION
378 MADLIB_SCHEMA.pca_train(
379  source_table TEXT, -- Source table name (dense matrix)
380  pc_table TEXT, -- Output table name for the principal components
381  row_id TEXT, -- Column name for the ID for each row
382  k INTEGER -- Number of principal components to compute
383 )
384 RETURNS VOID AS $$
385  SELECT MADLIB_SCHEMA.pca_train($1, $2, $3, $4, NULL, 0, False, NULL)
386 $$ LANGUAGE SQL;
387 
388 
389 -- Information Functions
390 -- -----------------------------------------------------------------------
391 
392 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.pca_train(
393  usage_string VARCHAR -- usage string
394 )
395 RETURNS TEXT AS $$
396 PythonFunctionBodyOnly(`pca', `pca')
397  return pca.pca_help_message(schema_madlib, usage_string)
398 $$ LANGUAGE plpythonu;
399 
400 
401 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.pca_train()
402 RETURNS VARCHAR AS $$
403 BEGIN
404  RETURN MADLIB_SCHEMA.pca_train('');
405 END;
406 $$ LANGUAGE plpgsql VOLATILE;
407 
408 -- -----------------------------------------------------------------------
409 -- PCA for Sparse matrices
410 -- -----------------------------------------------------------------------
411 /*
412 @brief Compute principal components for a sparse matrix stored in a
413  database table
414 */
415 CREATE OR REPLACE FUNCTION
416 MADLIB_SCHEMA.pca_sparse_train(
417  source_table TEXT, -- Source table name (dense matrix)
418  pc_table TEXT, -- Output table name for the principal components
419  row_id TEXT, -- Name of 'row_id' column in sparse matrix representation
420  col_id TEXT, -- Name of 'col_id' column in sparse matrix representation
421  val_id TEXT, -- Name of 'val_id' column in sparse matrix representation
422  row_dim INTEGER, -- Number of rows in the sparse matrix
423  col_dim INTEGER, -- Number of columns in the sparse matrix
424  k INTEGER, -- Number of eigenvectors with dominant eigenvalues, sorted decreasingly
425  grouping_cols TEXT, -- Comma-separated list of grouping columns (Default: NULL)
426  lanczos_iter INTEGER, -- The number of Lanczos iterations for the SVD calculation (Default: min(k+40, smallest Matrix dimension))
427  use_correlation BOOLEAN, -- If True correlation matrix is used for principal components (Default: False)
428  result_summary_table TEXT -- Table name to store summary of results (Default: NULL)
429 )
430 RETURNS VOID AS $$
431 PythonFunction(pca, pca, pca_sparse)
432 $$ LANGUAGE plpythonu;
433 
434 
435 -- Overloaded functions for optional parameters
436 -- -----------------------------------------------------------------------
437 CREATE OR REPLACE FUNCTION
438 MADLIB_SCHEMA.pca_sparse_train(
439  source_table TEXT, -- Source table name (dense matrix)
440  pc_table TEXT, -- Output table name for the principal components
441  row_id TEXT, -- Column name for the ID for each row
442  col_id TEXT, -- Name of 'col_id' column in sparse matrix representation
443  val_id TEXT, -- Name of 'val_id' column in sparse matrix representation
444  row_dim INTEGER, -- Number of rows in the sparse matrix
445  col_dim INTEGER, -- Number of columns in the sparse matrix
446  k INTEGER, -- Number of principal components to compute
447  grouping_cols TEXT, -- Comma-separated list of grouping columns
448  lanczos_iter INTEGER, -- The number of Lanczos iterations for the SVD calculation
449  use_correlation BOOLEAN -- If True correlation matrix is used for principal components
450 )
451 RETURNS VOID AS $$
452  SELECT MADLIB_SCHEMA.pca_sparse_train($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, NULL)
453 $$ LANGUAGE SQL;
454 
455 CREATE OR REPLACE FUNCTION
456 MADLIB_SCHEMA.pca_sparse_train(
457  source_table TEXT, -- Source table name (dense matrix)
458  pc_table TEXT, -- Output table name for the principal components
459  row_id TEXT, -- Column name for the ID for each row
460  col_id TEXT, -- Name of 'col_id' column in sparse matrix representation
461  val_id TEXT, -- Name of 'val_id' column in sparse matrix representation
462  row_dim INTEGER, -- Number of rows in the sparse matrix
463  col_dim INTEGER, -- Number of columns in the sparse matrix
464  k INTEGER, -- Number of principal components to compute
465  grouping_cols TEXT, -- Comma-separated list of grouping columns
466  lanczos_iter INTEGER -- The number of Lanczos iterations for the SVD calculation
467 )
468 RETURNS VOID AS $$
469  SELECT MADLIB_SCHEMA.pca_sparse_train($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, False , NULL)
470 $$ LANGUAGE SQL;
471 
472 CREATE OR REPLACE FUNCTION
473 MADLIB_SCHEMA.pca_sparse_train(
474  source_table TEXT, -- Source table name (dense matrix)
475  pc_table TEXT, -- Output table name for the principal components
476  row_id TEXT, -- Column name for the ID for each row
477  col_id TEXT, -- Name of 'col_id' column in sparse matrix representation
478  val_id TEXT, -- Name of 'val_id' column in sparse matrix representation
479  row_dim INTEGER, -- Number of rows in the sparse matrix
480  col_dim INTEGER, -- Number of columns in the sparse matrix
481  k INTEGER, -- Number of principal components to compute
482  grouping_cols TEXT -- Comma-separated list of grouping columns
483 )
484 RETURNS VOID AS $$
485  SELECT MADLIB_SCHEMA.pca_sparse_train($1, $2, $3, $4, $5, $6, $7, $8, $9, 0, False , NULL)
486 $$ LANGUAGE SQL;
487 
488 CREATE OR REPLACE FUNCTION
489 MADLIB_SCHEMA.pca_sparse_train(
490  source_table TEXT, -- Source table name (dense matrix)
491  pc_table TEXT, -- Output table name for the principal components
492  row_id TEXT, -- Column name for the ID for each row
493  col_id TEXT, -- Name of 'col_id' column in sparse matrix representation
494  val_id TEXT, -- Name of 'val_id' column in sparse matrix representation
495  row_dim INTEGER, -- Number of rows in the sparse matrix
496  col_dim INTEGER, -- Number of columns in the sparse matrix
497  k INTEGER -- Number of principal components to compute
498 )
499 RETURNS VOID AS $$
500  SELECT MADLIB_SCHEMA.pca_sparse_train($1, $2, $3, $4, $5, $6, $7, $8, NULL, 0, False, NULL)
501 $$ LANGUAGE SQL;
502 
503 
504 -- -----------------------------------------------------------------------
505 -- Information Functions
506 -- -----------------------------------------------------------------------
507 
508 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.pca_sparse_train(
509  usage_string VARCHAR -- usage string
510 )
511 RETURNS TEXT AS $$
512 PythonFunctionBodyOnly(`pca', `pca')
513  return pca.pca_sparse_help_message(schema_madlib, usage_string)
514 $$ LANGUAGE plpythonu;
515 
516 
517 CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.pca_sparse_train()
518 RETURNS TEXT AS $$
519 BEGIN
520  RETURN MADLIB_SCHEMA.pca_sparse_train('');
521 END;
522 $$ LANGUAGE plpgsql VOLATILE;