Contents

XGBoost is an optimized distributed gradient boosting library. The MADlib implementation is designed to train multiple models with different parameters to achieve parallelism on Greenplum systems (PostgreSQL is also supported.)

The main use case supported is classification. Regression is not currently supported.

Please note that the XGBoost implementation has the following limitations:

All of the dataset is collected in a single segment, compressed by zlib and stored in a single cells (one column of one row). PostgreSQL has a 1 GB limit for each cell which means the supported dataset size is limited.
The collected data is passed to the XGBoost in a single segment. This means, without grid search, the XGBoost is using only a single segment which might increase the run time.

XGBoost requires the following python libraries available in every host machine:

pandas v0.42
scikit-learn v0.19
xgboost v0.82

Versions are provided to ensure a python2 supported libraries can be obtained. Other versions of these libraries might work just as well.

Training Function

SELECT xgboost(
    source_table,
    output_table,
    id_column,
    dependent_variable,
    list_of_features,
    list_of_features_to_exclude,
    params_str,
    sample_weights,
    train_set_size,
    train_set_split_var
)

Arguments

source_table

TEXT. Name of the table containing the training data.

output_table

TEXT. Name of the generated table containing the model. If a table with the same name already exists, an error will be returned. A summary table named <output_table>_summary is also created.

id_column

TEXT. Name of the column containing id information in the training data. This is a mandatory argument and the values are expected to be unique for each row. Suggested column types are INTEGER, BIGINT, and VARCHAR.

dependent_variable

TEXT. Name of the column that contains the output (response) for training. Boolean, integer and text types are accepted for classification.

list_of_features

TEXT. Comma-separated string of column names or expressions to use as predictors. Can also be a '*' implying all columns are to be used as predictors (except for the ones included in the next argument that lists exclusions).

list_of_features_to_exclude (optional)

TEXT[]. Text array of column names to exclude from the predictors list. The names in this parameter should be identical to the names used in the table and quoted appropriately. The id_column and dependent_variable are excluded by default.

params_str (optional)

TEXT. Comma-separated string of key-value pairs used for initializing XGBClassifier. Each parameter can be assigned an array of values. These values are expanded into multiple parameter sets for grid search.

Note that these key-value pairs are passed to the xgboost as is and xgboost accepts any parameter thanks to kwargs. If there is a typo in the list, it might get ignored by xgboost.

eval_metric is a unique key for this dictionary. In XGBoost v0.82, this parameter is passed to the fit function and not the initializer like the rest. After v1.6.0, it was moved to the initializer. MADlib interface keeps these parameters together to ensure that when we upgrade the supported version of the XGBoost, we will be able to maintain same the interface.

Default values set by MADlib: learning_rate: 0.3 max_depth: 6 n_estimators: 100 eval_metric: 'auc'

sample_weights (optional)

TEXT. Column name containing numerical weights for sample_weights parameter of XGBClassifier fit function.

train_set_size (optional)

DOUBLE PRECISION. The proportion of the dataset to be used in training.

train_set_split_var (optional)

TEXT. Column name containing information on whether the associated row will be used for training or testing. If this parameter is set, train_set_size will be ignored.

Output

The model table produced by the training function contains the following columns:

model

BYTEA8. Trained XGBoost model stored in binary format (not human readable).

features

TEXT[]. Ordered levels (values) of categorical variables corresponding to the categorical features. Used to help interpret the trained model.

params_index

INTEGER. The index of the model. During grid search, each parameter set is given an index to distinguish different models.

A summary table named <output_table>_summary is also created at the same time, which has the following columns:

mdl_train_ts	TEXT. The timestamp for the initiation of the xgboost command.
mdl_name	TEXT. The name of the model set to <source_table>_xgboost.
features	TEXT[]. The list of features inputed to the XGBoost.
params	TEXT. The XGBClassifier parameters used for this particular model.
fnames	TEXT. The list of features in sorted order. This column is used for importance
importance	DOUBLE PRECISION[]. Importance values for each feature in the sorted order.
recall	DOUBLE PRECISION[]. Recall values for each class.
fscore	DOUBLE PRECISION[]. fscore values for each class.
support	DOUBLE PRECISION[]. support values for each class.
test_ids	INTEGER[]. The ids of the rows used for this particular model.
params_index	INTEGER. The index of the parameter set used for this model.

Prediction Function

SELECT xgboost_predict(
    test_table,
    model_table,
    predict_output_table,
    id_column,
    class_label,
    model_filters
);

Arguments

test_table

TEXT. Name of the table containing the test data.

model_table

TEXT. Name of the table containing the xgboost train output.

predict_output_table

TEXT. Name of the generated table containing the prediction. If a table with the same name already exists, an error will be returned. A metrics table named <output_table_name>_metrics and an roc curve table named <output_table_name>_roc_curve are also created if class_label is not NULL.

id_column

TEXT. Name of the column containing id information in the test data. This is a mandatory argument and the values are expected to be unique for each row.

class_label (optional)

TEXT. Name of the column containing class_label for the metrics and roc_curve calculations.

model_filters (optional)

TEXT. The filter for the model_table in case the user wants to use a subset of the models from the grid search.

Output

XGBoost prediction function creates three tables: <predict_output_table>, <predict_output_table>_metrics and <predict_output_table>_roc_curve. Some fields of the metrics table as well as the roc_curve table are available only if a class_label column is provided.

<predict_output_table>

id_column	INTEGER. Name of the column containing id information in the test data.
class_label_predicted	TEXT. The predicted value for the given data id.
class_label_proba_predicted	DOUBLE PRECISION[]. The prediction probabilities for the given data id.

<predict_output_table>_metrics

precision	DOUBLE PRECISION[]. Precision values for each class in the sorted order.
recall	DOUBLE PRECISION[]. Recall values for each class.
fscore	DOUBLE PRECISION[]. fscore values for each class.
support	DOUBLE PRECISION[]. support values for each class.
roc_auc_scores	DOUBLE PRECISION[]. roc_auc scores for each class.
feature_names	TEXT[]. The list of features in sorted order. This column is used for importance
feature_importance_scores	DOUBLE PRECISION[]. Importance values for each feature in the sorted order.

<predict_output_table>_roc_curve

fpr	DOUBLE PRECISION[]. False positive rate for the roc curve.
tpr	DOUBLE PRECISION[]. True positive rate for the roc curve.
thresholds	DOUBLE PRECISION[]. Threshold values for the roc curve.

Examples

Download the example sql file here.

Show the input data set.

SELECT * FROM abalone LIMIT 10;

 id | sex | length | diameter | height | whole_weight | shucked_weight | viscera_weight | shell_weight | rings
----+-----+--------+----------+--------+--------------+----------------+----------------+--------------+-------
  1 | M   |  0.455 |    0.365 |  0.095 |        0.514 |         0.2245 |          0.101 |         0.15 |    15
 12 | M   |   0.43 |     0.35 |   0.11 |        0.406 |         0.1675 |          0.081 |        0.135 |    10
 15 | F   |   0.47 |    0.355 |    0.1 |       0.4755 |         0.1675 |         0.0805 |        0.185 |    10
 20 | M   |   0.45 |     0.32 |    0.1 |        0.381 |         0.1705 |          0.075 |        0.115 |     9
 23 | F   |  0.565 |     0.44 |  0.155 |       0.9395 |         0.4275 |          0.214 |         0.27 |    12
 26 | F   |   0.56 |     0.44 |   0.14 |       0.9285 |         0.3825 |          0.188 |          0.3 |    11
 30 | M   |  0.575 |    0.425 |   0.14 |       0.8635 |          0.393 |          0.227 |          0.2 |    11
 31 | M   |   0.58 |     0.47 |  0.165 |       0.9975 |         0.3935 |          0.242 |         0.33 |    10
 35 | F   |  0.705 |     0.55 |    0.2 |       1.7095 |          0.633 |         0.4115 |         0.49 |    13
 36 | M   |  0.465 |    0.355 |  0.105 |       0.4795 |          0.227 |          0.124 |        0.125 |     8

Run XGBoost for a single parameter set and show the summary table

DROP TABLE IF EXISTS xgb_out, xgb_out_summary;
SELECT xgboost(
    'abalone',  -- Training table
    'id',       -- Id column
    'sex',      -- Class label column
    '*',        -- Independent variables
    NULL,       -- Columns to exclude from features
    $$
    {
        'learning_rate': [0.01], #Regularization on weights (eta). For smaller values, increase n_estimators
        'max_depth': [9],#Larger values could lead to overfitting
        'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting
        'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting
        'min_child_weight': [10],#larger values will prevent over-fitting
        'n_estimators':[100] #More estimators, lesser variance (better fit on test set)
    }
    $$,         -- XGBoost grid search parameters
    'xgb_out',  -- Grid search results table.
    '',         -- Class weights
    0.8,        -- Training set size ratio
    NULL        -- Variable used to do the test/train split.
);
\x on
SELECT * FROM xgb_out_summary;

-[ RECORD 1 ]+---------------------------------------------
mdl_train_ts | 2022-05-16 17:42:45.033789+03
mdl_name     | abalone_xgboost
features     | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings}
params       | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9')
fnames       | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height}
importance   | {1206,1189,1184,867,766,622,589,526}
precision    | {0.4295774647887324,0.6858006042296072,0.4318181818181818}
recall       | {0.46387832699619774,0.8021201413427562,0.328719723183391}
fscore       | {0.4460694698354662,0.739413680781759,0.3732809430255403}
support      | {263.0,283.0,289.0}
test_ids     | {486,3627,432,2766,132,2397,3313,2346...}

Run XGBoost prediction. For this example we are using the same abalone table for prediction as well.

DROP TABLE IF EXISTS xgb_score_out, xgb_score_out_metrics, xgb_score_out_roc_curve;
SELECT xgboost_predict(
    'abalone',          -- test_table
    'xgb_out',          -- model_table
    'xgb_score_out',    -- predict_output_table
    'id',               -- id_column
    'sex'               -- class_label
);
\x off
SELECT * FROM xgb_score_out LIMIT 10;

 id | sex_predicted |              sex_proba_predicted
----+---------------+------------------------------------------------
  5 | I             | {0.168330997229,0.632858633995,0.198810324073}
  6 | I             | {0.202547758818,0.574552714825,0.222899526358}
  9 | I             | {0.255484640598,0.44379144907,0.300723910332}
 10 | M             | {0.347418963909,0.242429286242,0.410151779652}
 11 | F             | {0.4157371521,0.316571623087,0.267691165209}
 13 | F             | {0.338543832302,0.32665386796,0.334802359343}
 14 | F             | {0.400314897299,0.293721526861,0.305963635445}
 17 | I             | {0.175603896379,0.608917593956,0.215478509665}
 21 | M             | {0.280931055546,0.333337903023,0.385731071234}
 25 | F             | {0.498989373446,0.185877665877,0.315133005381}

Show roc curve and metrics tables

SELECT * FROM xgb_score_out_roc_curve limit 10;

            fpr            |            tpr            |        thresholds
---------------------------+---------------------------+---------------------------
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
 {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}

\x on
SELECT * FROM xgb_score_out_metrics;

-[ RECORD 1 ]-------------+---------------------------------------------------------------------------------------
precision                 | {0.549047282992237,0.70062893081761,0.598290598290598}
recall                    | {0.595256312165264,0.830104321907601,0.458115183246073}
fscore                    | {0.571218795888399,0.759890859481583,0.518902891030393}
support                   | {1307,1342,1528}
roc_auc_scores            | {0.77659,0.9091,0.74816}
feature_names             | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height}
feature_importance_scores | {1206,1189,1184,867,766,622,589,526}

Run XGBoost grid search with parameter options

DROP TABLE IF EXISTS xgb_out, xgb_out_summary;
SELECT xgboost(
    'abalone',  -- Training table
    'id',       -- Id column
    'sex',      -- Class label column
    '*',        -- Independent variables
    NULL,       -- Columns to exclude from features
    $$
    {
        'learning_rate': [0.01,0.1], #Regularization on weights (eta). For smaller values, increase n_estimators
        'max_depth': [9,12],#Larger values could lead to overfitting
        'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting
        'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting
        'min_child_weight': [10],#larger values will prevent over-fitting
        'n_estimators':[100] #More estimators, lesser variance (better fit on test set)
    }
    $$,         -- XGBoost grid search parameters
    'xgb_out',  -- Grid search results table.
    '',         -- Class weights
    0.8,        -- Training set size ratio
    NULL        -- Variable used to do the test/train split.
);
\x on
SELECT * FROM xgb_out_summary;

-[ RECORD 1 ]+---------------------------------------------
mdl_train_ts | 2022-05-16 17:56:11.488767+03
mdl_name     | abalone_xgboost
features     | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings}
params       | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=12')
fnames       | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height}
importance   | {1276,1205,1200,906,864,822,580,415}
precision    | {0.4090909090909091,0.752411575562701,0.5}
recall       | {0.45188284518828453,0.8153310104529616,0.42071197411003236}
fscore       | {0.42942345924453285,0.782608695652174,0.45694200351493847}
support      | {239.0,287.0,309.0}
test_ids     | {3975,35,1759,1469,3437,3951...}
-[ RECORD 2 ]+---------------------------------------------
mdl_train_ts | 2022-05-16 17:56:11.488767+03
mdl_name     | abalone_xgboost
features     | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings}
params       | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9')
fnames       | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height}
importance   | {1268,1196,1182,860,832,668,595,461}
precision    | {0.46096654275092935,0.6566265060240963,0.5213675213675214}
recall       | {0.4901185770750988,0.8288973384030418,0.3824451410658307}
fscore       | {0.475095785440613,0.7327731092436974,0.4412296564195299}
support      | {253.0,263.0,319.0}
test_ids     | {2988,2303,2034,3085,2465,2887...}
-[ RECORD 3 ]+---------------------------------------------
mdl_train_ts | 2022-05-16 17:56:11.488767+03
mdl_name     | abalone_xgboost
features     | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings}
params       | ('colsample_bytree=0.85', 'learning_rate=0.1', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9')
fnames       | {shucked_weight,whole_weight,viscera_weight,shell_weight,length,diameter,height,rings}
importance   | {998,948,862,765,629,489,441,383}
precision    | {0.4635036496350365,0.6986301369863014,0.45724907063197023}
recall       | {0.4847328244274809,0.75,0.40863787375415284}
fscore       | {0.47388059701492535,0.7234042553191489,0.43157894736842106}
support      | {262.0,272.0,301.0}
test_ids     | {396,2150,809,2846,1108,1841...}
-[ RECORD 3 ]+---------------------------------------------
mdl_train_ts | 2022-05-16 17:56:11.488767+03
mdl_name     | abalone_xgboost
features     | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings}
params       | ('colsample_bytree=0.85', 'learning_rate=0.1', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=12')
fnames       | {shucked_weight,viscera_weight,whole_weight,shell_weight,length,diameter,height,rings}
importance   | {1101,1056,1045,958,680,621,458,458}
precision    | {0.40484429065743943,0.7016949152542373,0.43824701195219123}
recall       | {0.4517374517374517,0.75,0.36666666666666664}
fscore       | {0.42700729927007297,0.7250437828371278,0.3992740471869329}
support      | {259.0,276.0,300.0}
test_ids     | {1740,2777,1907,581,3525,1022...}

Run XGBoost prediction using params_index=2

DROP TABLE IF EXISTS xgb_score_out, xgb_score_out_metrics, xgb_score_out_roc_curve;
SELECT xgboost_predict(
    'abalone',          -- test_table
    'xgb_out',          -- model_table
    'xgb_score_out',    -- predict_output_table
    'id',               -- id_column
    'sex',              -- class_label
    2                   -- params_index
);
\x off
SELECT * FROM xgb_score_out LIMIT 10;

 id | sex_predicted |              sex_proba_predicted
----+---------------+------------------------------------------------
  5 | I             | {0.178420484066,0.599636971951,0.221942588687}
  6 | I             | {0.185853347182,0.602131128311,0.212015494704}
  9 | I             | {0.253592431545,0.440728843212,0.30567869544}
 10 | M             | {0.374555230141,0.249226689339,0.376218110323}
 11 | F             | {0.402999937534,0.336779236794,0.260220855474}
 13 | I             | {0.338803291321,0.36541134119,0.295785397291}
 14 | F             | {0.377499818802,0.301990658045,0.320509523153}
 17 | I             | {0.179169252515,0.602536559105,0.218294218183}
 21 | M             | {0.27216938138,0.33275142312,0.395079195499}
 25 | F             | {0.449239164591,0.187060594559,0.36370024085}

Literature

[1] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785