XGBoost is an optimized distributed gradient boosting library. The MADlib implementation is designed to train multiple models with different parameters to achieve parallelism on Greenplum systems (PostgreSQL is also supported.)
The main use case supported is classification. Regression is not currently supported.
Please note that the XGBoost implementation has the following limitations:
XGBoost requires the following python libraries available in every host machine:
Versions are provided to ensure a python2 supported libraries can be obtained. Other versions of these libraries might work just as well.
SELECT xgboost( source_table, output_table, id_column, dependent_variable, list_of_features, list_of_features_to_exclude, params_str, sample_weights, train_set_size, train_set_split_var )
Arguments
TEXT. Name of the table containing the training data.
TEXT. Name of the generated table containing the model. If a table with the same name already exists, an error will be returned. A summary table named <output_table>_summary is also created.
TEXT. Name of the column containing id information in the training data. This is a mandatory argument and the values are expected to be unique for each row. Suggested column types are INTEGER, BIGINT, and VARCHAR.
TEXT. Name of the column that contains the output (response) for training. Boolean, integer and text types are accepted for classification.
TEXT. Comma-separated string of column names or expressions to use as predictors. Can also be a '*' implying all columns are to be used as predictors (except for the ones included in the next argument that lists exclusions).
TEXT[]. Text array of column names to exclude from the predictors list. The names in this parameter should be identical to the names used in the table and quoted appropriately. The id_column and dependent_variable are excluded by default.
TEXT. Comma-separated string of key-value pairs used for initializing XGBClassifier. Each parameter can be assigned an array of values. These values are expanded into multiple parameter sets for grid search.
Note that these key-value pairs are passed to the xgboost as is and xgboost accepts any parameter thanks to kwargs. If there is a typo in the list, it might get ignored by xgboost.
eval_metric is a unique key for this dictionary. In XGBoost v0.82, this parameter is passed to the fit function and not the initializer like the rest. After v1.6.0, it was moved to the initializer. MADlib interface keeps these parameters together to ensure that when we upgrade the supported version of the XGBoost, we will be able to maintain same the interface.
Default values set by MADlib: learning_rate: 0.3 max_depth: 6 n_estimators: 100 eval_metric: 'auc'
TEXT. Column name containing numerical weights for sample_weights parameter of XGBClassifier fit function.
DOUBLE PRECISION. The proportion of the dataset to be used in training.
Output
The model table produced by the training function contains the following columns:
model | BYTEA8. Trained XGBoost model stored in binary format (not human readable). |
---|---|
features | TEXT[]. Ordered levels (values) of categorical variables corresponding to the categorical features. Used to help interpret the trained model. |
params_index | INTEGER. The index of the model. During grid search, each parameter set is given an index to distinguish different models. |
A summary table named <output_table>_summary is also created at the same time, which has the following columns:
mdl_train_ts | TEXT. The timestamp for the initiation of the xgboost command. |
---|---|
mdl_name | TEXT. The name of the model set to <source_table>_xgboost. |
features | TEXT[]. The list of features inputed to the XGBoost. |
params | TEXT. The XGBClassifier parameters used for this particular model. |
fnames | TEXT. The list of features in sorted order. This column is used for importance |
importance | DOUBLE PRECISION[]. Importance values for each feature in the sorted order. |
recall | DOUBLE PRECISION[]. Recall values for each class. |
fscore | DOUBLE PRECISION[]. fscore values for each class. |
support | DOUBLE PRECISION[]. support values for each class. |
test_ids | INTEGER[]. The ids of the rows used for this particular model. |
params_index | INTEGER. The index of the parameter set used for this model. |
SELECT xgboost_predict( test_table, model_table, predict_output_table, id_column, class_label, model_filters );
Arguments
TEXT. Name of the table containing the test data.
TEXT. Name of the table containing the xgboost train output.
TEXT. Name of the generated table containing the prediction. If a table with the same name already exists, an error will be returned. A metrics table named <output_table_name>_metrics and an roc curve table named <output_table_name>_roc_curve are also created if class_label is not NULL.
TEXT. Name of the column containing id information in the test data. This is a mandatory argument and the values are expected to be unique for each row.
TEXT. Name of the column containing class_label for the metrics and roc_curve calculations.
Output
XGBoost prediction function creates three tables: <predict_output_table>, <predict_output_table>_metrics and <predict_output_table>_roc_curve. Some fields of the metrics table as well as the roc_curve table are available only if a class_label column is provided.
<predict_output_table>
id_column | INTEGER. Name of the column containing id information in the test data. |
---|---|
class_label_predicted | TEXT. The predicted value for the given data id. |
class_label_proba_predicted | DOUBLE PRECISION[]. The prediction probabilities for the given data id. |
<predict_output_table>_metrics
precision | DOUBLE PRECISION[]. Precision values for each class in the sorted order. |
---|---|
recall | DOUBLE PRECISION[]. Recall values for each class. |
fscore | DOUBLE PRECISION[]. fscore values for each class. |
support | DOUBLE PRECISION[]. support values for each class. |
roc_auc_scores | DOUBLE PRECISION[]. roc_auc scores for each class. |
feature_names | TEXT[]. The list of features in sorted order. This column is used for importance |
feature_importance_scores | DOUBLE PRECISION[]. Importance values for each feature in the sorted order. |
<predict_output_table>_roc_curve
fpr | DOUBLE PRECISION[]. False positive rate for the roc curve. |
---|---|
tpr | DOUBLE PRECISION[]. True positive rate for the roc curve. |
thresholds | DOUBLE PRECISION[]. Threshold values for the roc curve. |
Download the example sql file here.
SELECT * FROM abalone LIMIT 10;
id | sex | length | diameter | height | whole_weight | shucked_weight | viscera_weight | shell_weight | rings ----+-----+--------+----------+--------+--------------+----------------+----------------+--------------+------- 1 | M | 0.455 | 0.365 | 0.095 | 0.514 | 0.2245 | 0.101 | 0.15 | 15 12 | M | 0.43 | 0.35 | 0.11 | 0.406 | 0.1675 | 0.081 | 0.135 | 10 15 | F | 0.47 | 0.355 | 0.1 | 0.4755 | 0.1675 | 0.0805 | 0.185 | 10 20 | M | 0.45 | 0.32 | 0.1 | 0.381 | 0.1705 | 0.075 | 0.115 | 9 23 | F | 0.565 | 0.44 | 0.155 | 0.9395 | 0.4275 | 0.214 | 0.27 | 12 26 | F | 0.56 | 0.44 | 0.14 | 0.9285 | 0.3825 | 0.188 | 0.3 | 11 30 | M | 0.575 | 0.425 | 0.14 | 0.8635 | 0.393 | 0.227 | 0.2 | 11 31 | M | 0.58 | 0.47 | 0.165 | 0.9975 | 0.3935 | 0.242 | 0.33 | 10 35 | F | 0.705 | 0.55 | 0.2 | 1.7095 | 0.633 | 0.4115 | 0.49 | 13 36 | M | 0.465 | 0.355 | 0.105 | 0.4795 | 0.227 | 0.124 | 0.125 | 8
DROP TABLE IF EXISTS xgb_out, xgb_out_summary; SELECT xgboost( 'abalone', -- Training table 'id', -- Id column 'sex', -- Class label column '*', -- Independent variables NULL, -- Columns to exclude from features $$ { 'learning_rate': [0.01], #Regularization on weights (eta). For smaller values, increase n_estimators 'max_depth': [9],#Larger values could lead to overfitting 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting 'min_child_weight': [10],#larger values will prevent over-fitting 'n_estimators':[100] #More estimators, lesser variance (better fit on test set) } $$, -- XGBoost grid search parameters 'xgb_out', -- Grid search results table. '', -- Class weights 0.8, -- Training set size ratio NULL -- Variable used to do the test/train split. ); \x on SELECT * FROM xgb_out_summary;
-[ RECORD 1 ]+--------------------------------------------- mdl_train_ts | 2022-05-16 17:42:45.033789+03 mdl_name | abalone_xgboost features | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings} params | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9') fnames | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height} importance | {1206,1189,1184,867,766,622,589,526} precision | {0.4295774647887324,0.6858006042296072,0.4318181818181818} recall | {0.46387832699619774,0.8021201413427562,0.328719723183391} fscore | {0.4460694698354662,0.739413680781759,0.3732809430255403} support | {263.0,283.0,289.0} test_ids | {486,3627,432,2766,132,2397,3313,2346...}
DROP TABLE IF EXISTS xgb_score_out, xgb_score_out_metrics, xgb_score_out_roc_curve; SELECT xgboost_predict( 'abalone', -- test_table 'xgb_out', -- model_table 'xgb_score_out', -- predict_output_table 'id', -- id_column 'sex' -- class_label ); \x off SELECT * FROM xgb_score_out LIMIT 10;
id | sex_predicted | sex_proba_predicted ----+---------------+------------------------------------------------ 5 | I | {0.168330997229,0.632858633995,0.198810324073} 6 | I | {0.202547758818,0.574552714825,0.222899526358} 9 | I | {0.255484640598,0.44379144907,0.300723910332} 10 | M | {0.347418963909,0.242429286242,0.410151779652} 11 | F | {0.4157371521,0.316571623087,0.267691165209} 13 | F | {0.338543832302,0.32665386796,0.334802359343} 14 | F | {0.400314897299,0.293721526861,0.305963635445} 17 | I | {0.175603896379,0.608917593956,0.215478509665} 21 | M | {0.280931055546,0.333337903023,0.385731071234} 25 | F | {0.498989373446,0.185877665877,0.315133005381}
SELECT * FROM xgb_score_out_roc_curve limit 10;
fpr | tpr | thresholds ---------------------------+---------------------------+--------------------------- {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352} {0.00000,0.00000,0.00000} | {0.00077,0.00075,0.00065} | {0.54791,0.65354,0.49352}
\x on SELECT * FROM xgb_score_out_metrics;
-[ RECORD 1 ]-------------+--------------------------------------------------------------------------------------- precision | {0.549047282992237,0.70062893081761,0.598290598290598} recall | {0.595256312165264,0.830104321907601,0.458115183246073} fscore | {0.571218795888399,0.759890859481583,0.518902891030393} support | {1307,1342,1528} roc_auc_scores | {0.77659,0.9091,0.74816} feature_names | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height} feature_importance_scores | {1206,1189,1184,867,766,622,589,526}
DROP TABLE IF EXISTS xgb_out, xgb_out_summary; SELECT xgboost( 'abalone', -- Training table 'id', -- Id column 'sex', -- Class label column '*', -- Independent variables NULL, -- Columns to exclude from features $$ { 'learning_rate': [0.01,0.1], #Regularization on weights (eta). For smaller values, increase n_estimators 'max_depth': [9,12],#Larger values could lead to overfitting 'subsample': [0.85],#introduce randomness in samples picked to prevent overfitting 'colsample_bytree': [0.85],#introduce randomness in features picked to prevent overfitting 'min_child_weight': [10],#larger values will prevent over-fitting 'n_estimators':[100] #More estimators, lesser variance (better fit on test set) } $$, -- XGBoost grid search parameters 'xgb_out', -- Grid search results table. '', -- Class weights 0.8, -- Training set size ratio NULL -- Variable used to do the test/train split. ); \x on SELECT * FROM xgb_out_summary;
-[ RECORD 1 ]+--------------------------------------------- mdl_train_ts | 2022-05-16 17:56:11.488767+03 mdl_name | abalone_xgboost features | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings} params | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=12') fnames | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height} importance | {1276,1205,1200,906,864,822,580,415} precision | {0.4090909090909091,0.752411575562701,0.5} recall | {0.45188284518828453,0.8153310104529616,0.42071197411003236} fscore | {0.42942345924453285,0.782608695652174,0.45694200351493847} support | {239.0,287.0,309.0} test_ids | {3975,35,1759,1469,3437,3951...} -[ RECORD 2 ]+--------------------------------------------- mdl_train_ts | 2022-05-16 17:56:11.488767+03 mdl_name | abalone_xgboost features | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings} params | ('colsample_bytree=0.85', 'learning_rate=0.01', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9') fnames | {viscera_weight,whole_weight,shucked_weight,shell_weight,length,rings,diameter,height} importance | {1268,1196,1182,860,832,668,595,461} precision | {0.46096654275092935,0.6566265060240963,0.5213675213675214} recall | {0.4901185770750988,0.8288973384030418,0.3824451410658307} fscore | {0.475095785440613,0.7327731092436974,0.4412296564195299} support | {253.0,263.0,319.0} test_ids | {2988,2303,2034,3085,2465,2887...} -[ RECORD 3 ]+--------------------------------------------- mdl_train_ts | 2022-05-16 17:56:11.488767+03 mdl_name | abalone_xgboost features | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings} params | ('colsample_bytree=0.85', 'learning_rate=0.1', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=9') fnames | {shucked_weight,whole_weight,viscera_weight,shell_weight,length,diameter,height,rings} importance | {998,948,862,765,629,489,441,383} precision | {0.4635036496350365,0.6986301369863014,0.45724907063197023} recall | {0.4847328244274809,0.75,0.40863787375415284} fscore | {0.47388059701492535,0.7234042553191489,0.43157894736842106} support | {262.0,272.0,301.0} test_ids | {396,2150,809,2846,1108,1841...} -[ RECORD 3 ]+--------------------------------------------- mdl_train_ts | 2022-05-16 17:56:11.488767+03 mdl_name | abalone_xgboost features | {length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings} params | ('colsample_bytree=0.85', 'learning_rate=0.1', 'min_child_weight=10', 'n_estimators=100', 'subsample=0.85', 'max_depth=12') fnames | {shucked_weight,viscera_weight,whole_weight,shell_weight,length,diameter,height,rings} importance | {1101,1056,1045,958,680,621,458,458} precision | {0.40484429065743943,0.7016949152542373,0.43824701195219123} recall | {0.4517374517374517,0.75,0.36666666666666664} fscore | {0.42700729927007297,0.7250437828371278,0.3992740471869329} support | {259.0,276.0,300.0} test_ids | {1740,2777,1907,581,3525,1022...}
DROP TABLE IF EXISTS xgb_score_out, xgb_score_out_metrics, xgb_score_out_roc_curve; SELECT xgboost_predict( 'abalone', -- test_table 'xgb_out', -- model_table 'xgb_score_out', -- predict_output_table 'id', -- id_column 'sex', -- class_label 2 -- params_index ); \x off SELECT * FROM xgb_score_out LIMIT 10;
id | sex_predicted | sex_proba_predicted ----+---------------+------------------------------------------------ 5 | I | {0.178420484066,0.599636971951,0.221942588687} 6 | I | {0.185853347182,0.602131128311,0.212015494704} 9 | I | {0.253592431545,0.440728843212,0.30567869544} 10 | M | {0.374555230141,0.249226689339,0.376218110323} 11 | F | {0.402999937534,0.336779236794,0.260220855474} 13 | I | {0.338803291321,0.36541134119,0.295785397291} 14 | F | {0.377499818802,0.301990658045,0.320509523153} 17 | I | {0.179169252515,0.602536559105,0.218294218183} 21 | M | {0.27216938138,0.33275142312,0.395079195499} 25 | F | {0.449239164591,0.187060594559,0.36370024085}
[1] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785