MADlib
0.7 A newer version is available
User Documentation
|
'summary' is a generic function used to produce summary statistics of any data table. The function invokes particular 'methods' from the MADlib library to provide an overview of the data.
SELECT MADlib.summary ( source_table TEXT, -- Source table name (Required) output_table TEXT, -- Output table name (Required) target_cols TEXT, -- Comma separated columns for which summary is desired -- (Default: NULL - produces result for all columns) grouping_cols TEXT, -- Comma separated columns on which to group results -- (Default: NULL - provides summary on complete table) get_distinct BOOLEAN, -- Are distinct values required? -- (Default: True) get_quartiles BOOLEAN, -- Are quartiles required? -- (Default: True) ntile_array FLOAT8[], -- Array of quantile values to compute -- (Default: NULL - Quantile array not included) how_many_mfv INTEGER, -- How many most-frequent-values to compute? -- (Default: 10) get_estimates BOOLEAN -- Should we produce exact or estimated values? ) -- (Default: True)
Note:
The output of the function is a composite type containing: ouput_table TEXT, -- Name of the output table row_count INT4, -- Number of rows in the output table duration FLOAT8 -- Time taken (in seconds) to compute the summary
The summary stastics are stored in the 'output_table' relation provided in the arguments. The relation 'output_table' can contain the following table (presence of some columns depends on the argument values)
- group_by_column : Group-by column names (NULL if none provided) - group_by_value : Values of the Group-by column (NULL if no grouping) - target_column : Targeted column values for which summary is requested - column_number : Physical column number for the target column, as described in pg_attribute - data_type : Data type of target column. Standard GPDB descriptors will be displayed - row_count : Number of rows for the target column - distinct_values : Number of distinct values in the target column - missing_values : Number of missing values in the target column - blank_values : Number of blank values (blanks are defined by the regular expression '^\w*$') - fraction_missing : Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3) - fraction_blank : Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3) - mean : Mean value of target column (if target is numeric, else NULL) - variance : Variance of target columns (if target is numeric, else NULL for strings) - min : Min value of target column (for strings this is the length of the shortest string) - max : Max value of target column (for strings this is the length of the longest string) - first_quartile : First quartile (25th percentile, only for numeric columns) - median : Median value of target column (if target is numeric, else NULL) - third_quartile : Third quartile (25th percentile, only for numeric columns) - quantile_array : Percentile values corresponding to ntile_array - most_frequent_values : Most frequent values - mfv_frequencies : Frequency of the most frequent values
The output can be obtained as
sql> SELECT * FROM 'output_table';
The usage information can be obtained at any time directly from the function using
sql> SELECT summary('usage');