The MADlib summary() function produces summary statistics for any data table. The function invokes various methods from the MADlib library to provide the data overview.
summary ( source_table, output_table, target_cols, grouping_cols, get_distinct, get_quartiles, ntile_array, how_many_mfv, get_estimates, n_cols_per_run )
The summary() function returns a composite type containing three fields:
output_table | TEXT. The name of the output table. |
---|---|
num_col_summarized | INTEGER. The number of columns from the source table that have been summarized. |
duration | FLOAT8. The time taken (in seconds) to compute the summary. |
Arguments
TEXT. Name of the table containing the input data.
TEXT. Name of the table for the output summary statistics. This table contains the following columns:
group_by | Group-by column name. NULL if none provided. |
---|---|
group_by_value | Value of the group-by column. NULL if there is no grouping. |
target_column | Targeted column values for which summary is requested. |
column_number | Physical column number for the target column, as described in pg_attribute catalog. |
data_type | Data type of the target column. Standard GPDB type descriptors are displayed. |
row_count | Number of rows for the target column. |
distinct_values | Number of distinct values in the target column. If the summary() function is called with the get_estimates argument set to TRUE (default), then this is an estimated statistic based on the Flajolet-Martin distinct count estimator. If the get_estimates argument set to FALSE, will use PostgreSQL COUNT DISTINCT. |
missing_values | Number of missing values in the target column. |
blank_values | Number of blank values. Blanks are defined by this regular expression:'^\w*$' |
fraction_missing | Percentage of total rows that are missing, as a decimal value, e.g. 0.3. |
fraction_blank | Percentage of total rows that are blank, as a decimal value, e.g. 0.3. |
positive_values | Number of positive values in the target column if target is numeric, otherwise NULL. |
negative_values | Number of negative values in the target column if target is numeric, otherwise NULL. |
zero_values | Number of zero values in the target column if target is numeric, otherwise NULL. Note that we are reporting exact equality to 0.0 here, so even if you have a float value that is extremely small (say due to rounding), it will not be reported as a zero value. |
mean | Mean value of target column if target is numeric, otherwise NULL. |
variance | Variance of target column if target is numeric, otherwise NULL. |
confidence_interval | Confidence interval (95% using z-score) of the mean value for the target column if target is numeric, otherwise NULL. Presented as an array of two elements in the form {lower bound, upper bound}. |
min | Minimum value of target column. For strings this is the length of the shortest string. |
max | Maximum value of target column. For strings this is the length of the longest string. |
first_quartile | First quartile (25th percentile), only for numeric columns. (Unavailable for PostgreSQL 9.3 or lower.) |
median | Median value of target column, if target is numeric, otherwise NULL. (Unavailable for PostgreSQL 9.3 or lower.) |
third_quartile | Third quartile (25th percentile), only for numeric columns. (Unavailable for PostgreSQL 9.3 or lower.) |
quantile_array | Percentile values corresponding to ntile_array. (Unavailable for PostgreSQL 9.3 or lower.) |
most_frequent_values | An array containing the most frequently occurring values. The how_many_mfv argument determines the length of the array, which is 10 by default. If the summary() function is called with the get_estimates argument set to TRUE (default), the frequent values computation is performed using a parallel aggregation method that is faster, but in some cases may fail to detect the exact most frequent values. |
mfv_frequencies | Array containing the frequency count for each of the most frequent values. |
TEXT, default NULL. A comma-separated list of columns to summarize. If NULL, summaries are produced for all columns.
BOOLEAN, default TRUE. If true, distinct values are counted. The method for computing distinct values depends on the setting of the 'get_estimates' parameter below.
BOOLEAN, default TRUE. If TRUE, quartiles are computed.
INTEGER, default: 10. The number of most-frequent-values to compute. The method for computing MFV depends on the setting of the 'get_estimates' parameter below.
BOOLEAN, default TRUE. If TRUE, estimated values are produced for distinct values and most frequent values. If FALSE, exact values are calculated which will take longer to run, with the impact depending on data size.
SELECT * FROM madlib.summary();
DROP TABLE IF EXISTS iris; CREATE TABLE iris (id INT, sepal_length FLOAT, sepal_width FLOAT, petal_length FLOAT, petal_width FLOAT, class_name text); INSERT INTO iris VALUES (1,5.1,3.5,1.4,0.2,'Iris-setosa'), (2,4.9,3.0,1.4,0.2,'Iris-setosa'), (3,4.7,3.2,1.3,0.2,'Iris-setosa'), (4,4.6,3.1,1.5,0.2,'Iris-setosa'), (5,5.0,3.6,1.4,0.2,'Iris-setosa'), (6,5.4,3.9,1.7,0.4,'Iris-setosa'), (7,4.6,3.4,1.4,0.3,'Iris-setosa'), (8,5.0,3.4,1.5,0.2,'Iris-setosa'), (9,4.4,2.9,1.4,0.2,'Iris-setosa'), (10,4.9,3.1,1.5,0.1,'Iris-setosa'), (11,7.0,3.2,4.7,1.4,'Iris-versicolor'), (12,6.4,3.2,4.5,1.5,'Iris-versicolor'), (13,6.9,3.1,4.9,1.5,'Iris-versicolor'), (14,5.5,2.3,4.0,1.3,'Iris-versicolor'), (15,6.5,2.8,4.6,1.5,'Iris-versicolor'), (16,5.7,2.8,4.5,1.3,'Iris-versicolor'), (17,6.3,3.3,4.7,1.6,'Iris-versicolor'), (18,4.9,2.4,3.3,1.0,'Iris-versicolor'), (19,6.6,2.9,4.6,1.3,'Iris-versicolor'), (20,5.2,2.7,3.9,1.4,'Iris-versicolor'), (21,6.3,3.3,6.0,2.5,'Iris-virginica'), (22,5.8,2.7,5.1,1.9,'Iris-virginica'), (23,7.1,3.0,5.9,2.1,'Iris-virginica'), (24,6.3,2.9,5.6,1.8,'Iris-virginica'), (25,6.5,3.0,5.8,2.2,'Iris-virginica'), (26,7.6,3.0,6.6,2.1,'Iris-virginica'), (27,4.9,2.5,4.5,1.7,'Iris-virginica'), (28,7.3,2.9,6.3,1.8,'Iris-virginica'), (29,6.7,2.5,5.8,1.8,'Iris-virginica'), (30,7.2,3.6,6.1,2.5,'Iris-virginica');
DROP TABLE IF EXISTS iris_summary; SELECT * FROM madlib.summary( 'iris', -- Source table 'iris_summary' -- Output table );Result:
output_table | num_col_summarized | duration --------------+--------------------+------------------- iris_summary | 6 | 0.574938058853149 (1 row)View the summary data.
-- Turn on expanded display for readability. \x on SELECT * FROM iris_summary;Result (partial):
... -[ RECORD 2 ]--------+--------------------------------------------- group_by | group_by_value | target_column | sepal_length column_number | 2 data_type | float8 row_count | 30 distinct_values | 22 missing_values | 0 blank_values | fraction_missing | 0 fraction_blank | positive_values | 30 negative_values | 0 zero_values | 0 mean | 5.84333333333333 variance | 0.929436781609188 confidence_interval | {5.49834423494374,6.18832243172292} min | 4.4 max | 7.6 first_quartile | 4.925 median | 5.75 third_quartile | 6.575 most_frequent_values | {4.9,6.3,5,6.5,4.6,7.2,5.5,5.7,7.3,6.7} mfv_frequencies | {4,3,2,2,2,1,1,1,1,1} ... -[ RECORD 6 ]--------+--------------------------------------------- group_by | group_by_value | target_column | class_name column_number | 6 data_type | text row_count | 30 distinct_values | 3 missing_values | 0 blank_values | 0 fraction_missing | 0 fraction_blank | 0 positive_values | negative_values | zero_values | mean | variance | confidence_interval | min | 11 max | 15 first_quartile | median | third_quartile | most_frequent_values | {Iris-setosa,Iris-versicolor,Iris-virginica} mfv_frequencies | {10,10,10}Note that for the text column in record 6, some statistics are n/a, and the min and max values represent the length of the shortest and longest strings respectively.
DROP TABLE IF EXISTS iris_summary; SELECT * FROM madlib.summary( 'iris', -- Source table 'iris_summary', -- Output table 'sepal_length, sepal_width', -- Columns to summarize 'class_name' -- Grouping column ); SELECT * FROM iris_summary;Result (partial):
-[ RECORD 1 ]--------+---------------------------------------- group_by | class_name group_by_value | Iris-setosa target_column | sepal_length column_number | 2 data_type | float8 row_count | 10 distinct_values | 7 missing_values | 0 blank_values | fraction_missing | 0 fraction_blank | positive_values | 10 negative_values | 0 zero_values | 0 mean | 4.86 variance | 0.0848888888888875 confidence_interval | {4.67941507384182,5.04058492615818} min | 4.4 max | 5.4 first_quartile | 4.625 median | 4.9 third_quartile | 5 most_frequent_values | {4.9,5,4.6,5.1,4.7,5.4,4.4} mfv_frequencies | {2,2,2,1,1,1,1} ... -[ RECORD 3 ]--------+---------------------------------------- group_by | class_name group_by_value | Iris-versicolor target_column | sepal_length column_number | 2 data_type | float8 row_count | 10 distinct_values | 10 missing_values | 0 blank_values | fraction_missing | 0 fraction_blank | positive_values | 10 negative_values | 0 zero_values | 0 mean | 6.1 variance | 0.528888888888893 confidence_interval | {5.64924734548141,6.55075265451859} min | 4.9 max | 7 first_quartile | 5.55 median | 6.35 third_quartile | 6.575 most_frequent_values | {6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,7,6.4} mfv_frequencies | {1,1,1,1,1,1,1,1,1,1} ...
DROP TABLE IF EXISTS iris_summary; SELECT * FROM madlib.summary( 'iris', -- Source table 'iris_summary', -- Output table 'sepal_length, sepal_width', -- Columns to summarize NULL, -- No grouping TRUE, -- Get distinct values FALSE, -- Dont get quartiles ARRAY[0.33, 0.66], -- Get ntiles 3, -- Number of MFV to compute FALSE -- Get exact values ); SELECT * FROM iris_summary;Result:
-[ RECORD 1 ]--------+------------------------------------ group_by | group_by_value | target_column | sepal_length column_number | 2 data_type | float8 row_count | 30 distinct_values | 22 missing_values | 0 blank_values | fraction_missing | 0 fraction_blank | positive_values | 30 negative_values | 0 zero_values | 0 mean | 5.84333333333333 variance | 0.929436781609175 confidence_interval | {5.49834423494375,6.18832243172292} min | 4.4 max | 7.6 quantile_array | {5.057,6.414} most_frequent_values | {4.9,6.3,6.5} mfv_frequencies | {4,3,2} -[ RECORD 2 ]--------+------------------------------------ group_by | group_by_value | target_column | sepal_width column_number | 3 data_type | float8 row_count | 30 distinct_values | 14 missing_values | 0 blank_values | fraction_missing | 0 fraction_blank | positive_values | 30 negative_values | 0 zero_values | 0 mean | 3.04 variance | 0.13903448275862 confidence_interval | {2.90656901047539,3.17343098952461} min | 2.3 max | 3.9 quantile_array | {2.9,3.2} most_frequent_values | {2.9,3,3.2} mfv_frequencies | {4,4,3}
FM (Flajolet-Martin)
MFV (Most Frequent Values)
CountMin (Cormode-Muthukrishnan)