▼Data Types and Transformations | |
►Arrays and Matrices | Mathematical operations for arrays and matrices |
Array Operations | Provides fast array operations supporting other MADlib modules |
Matrix Operations | Provides fast matrix operations supporting other MADlib modules |
►Matrix Factorization | Linear algebra methods that factorize a matrix into a product of matrices |
Low-Rank Matrix Factorization | Performs low-rank matrix factorization for an incomplete matrix |
Singular Value Decomposition | Performs factorization of dense and sparse matrices |
Norms and Distance Functions | Provides utility functions for basic linear algebra operations |
Sparse Vectors | Implements a sparse vector data type that provides compressed storage of vectors that may have many duplicate elements |
Encoding Categorical Variables | Functions to encode categorical variables to prepare data for input into predictive algorithms |
Path | A function to perform complex pattern matching across rows and extract useful information about the matches |
Pivot | Pivoting and data summarization tools for preparing data for modeling operations |
Sessionize | Session reconstruction of data consisting of a time stampled sequence of events |
Stemming | Provides porter stemmer operations supporting other MADlib modules |
▼Deep Learning | A collection of modules for deep learning |
►Model Preparation | Prepare models and data for deep learning |
Preprocess Data | Prepare training data for use by deep learning modules |
Define Model Architectures | Function to load model architectures and weights into a table |
Define Custom Functions | Function to load serialized Python objects into a table |
Train Single Model | Fit, evaluate and predict for one model |
►Train Multiple Models | Train multiple deep learning models at the same time for model architecture search and hyperparameter selection |
Define Model Configurations | Generate configurations for model architecture search and hyperparameter tuning |
Train Model Configurations | Explore network architectures and hyperparameters by training many models a time |
AutoML | Functions to run automated machine learning (autoML) methods for model architecture search and hyperparameter tuning |
►Utilities for Deep Learning | Utilities specific to deep learning workflows |
Show GPU Configuration | Utility function to report number and type of GPUs in the database cluster |
▼Graph | Graph algorithms and measures associated with graphs |
All Pairs Shortest Path | Finds the shortest paths between every vertex pair in a given graph |
Breadth-First Search | Finds the nodes reachable from a given source vertex using a breadth-first approach |
HITS | Find the HITS scores (authority and hub) of all vertices in a directed graph |
►Measures | A collection of metrics computed on a graph |
Average Path Length | Computes the average shortest-path length of a graph |
Closeness | Computes the closeness centrality value of each node in the graph |
Graph Diameter | Computes the diameter of a graph |
In-Out Degree | Computes the degrees for each vertex |
PageRank | Find the PageRank of all vertices in a directed graph |
Single Source Shortest Path | Finds the shortest path from a single source vertex to every other vertex in a given graph |
Weakly Connected Components | Find all weakly connected components of a graph |
▼Model Selection | Functions for model selection and model evaluation |
Cross Validation | Estimates the fit of a predictive model given a data set and specifications for the training, prediction, and error estimation functions |
Prediction Metrics | Provides various prediction accuracy metrics |
Train-Test Split | A method for splitting a data set into separate training and testing sets |
▼Sampling | A collection of methods for sampling from a population |
Balanced Sampling | A method to independently sample classes to produce a balanced data set. This is commonly used when classes are imbalanced, to ensure that subclasses are adequately represented in the sample |
Stratified Sampling | A method for independently sampling subpopulations (strata) |
▼Statistics | A collection of probability and statistics modules |
►Descriptive Statistics | Methods to compute descriptive statistics of a dataset |
►Cardinality Estimators | Methods to estimate the number of unique values contained in data |
CountMin (Cormode-Muthukrishnan) | Implements Cormode-Mathukrishnan CountMin sketches on integer values as a user-defined aggregate |
FM (Flajolet-Martin) | Implements Flajolet-Martin's distinct count estimation as a user-defined aggregate |
MFV (Most Frequent Values) | Implements the most frequent values variant of the CountMin sketch as a user-defined aggregate |
Covariance and Correlation | Generates a covariance or Pearson correlation matrix for pairs of numeric columns in a table |
Summary | Calculates general descriptive statistics for any data table |
►Inferential Statistics | Methods to compute inferential statistics of a dataset |
Hypothesis Tests | Provides functions to perform statistical hypothesis tests |
Probability Functions | Provides cumulative distribution, density/mass, and quantile functions for a wide range of probability distributions |
▼Supervised Learning | Methods to perform a variety of supervised learning tasks |
Conditional Random Field | Constructs a Conditional Random Fields (CRF) model for labeling sequential data |
k-Nearest Neighbors | Finds \(k\) nearest data points to the given data point and outputs majority vote value of output classes for classification, or average value of target values for regression |
Neural Network | Solves classification and regression problems with several fully connected layers and non-linear activation functions |
►Regression Models | A collection of methods for modeling conditional expectation of a response variable |
Clustered Variance | Calculates clustered variance for linear, logistic, and multinomial logistic regression models, and Cox proportional hazards models |
Cox-Proportional Hazards Regression | Models the relationship between one or more independent predictor variables and the amount of time before an event occurs |
Elastic Net Regularization | Generates a regularized regression model for variable selection in linear and logistic regression problems, combining the L1 and L2 penalties of the lasso and ridge methods |
Generalized Linear Models | Estimate generalized linear model (GLM). GLM is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value |
Linear Regression | Also called Ordinary Least Squares Regression, models linear relationship between a dependent variable and one or more independent variables |
Logistic Regression | Models the relationship between one or more predictor variables and a binary categorical dependent variable by predicting the probability of the dependent variable using a logistic function |
Marginal Effects | Calculates marginal effects for the coefficients in regression problems |
Multinomial Regression | Multinomial regression is to model the conditional distribution of the multinomial response variable using a linear combination of predictors |
Ordinal Regression | Regression to model data with ordinal response variable |
Robust Variance | Calculates Huber-White variance estimates for linear, logistic, and multinomial regression models, and for Cox proportional hazards models |
Support Vector Machines | Solves classification and regression problems by separating data with a hyperplane or other nonlinear decision boundary |
►Tree Methods | A collection of recursive partitioning (tree) methods |
Decision Tree | Decision trees are tree-based supervised learning methods that can be used for classification and regression |
Random Forest | Random forest is an ensemble learning method for classification and regression that construct a multitude of decision trees at training time, then produces the class that is the mean (regression) or mode (classification) of the prediction produced by the individual trees |
▼Time Series Analysis | A collection of methods to analyze time series data |
ARIMA | Generates a model with autoregressive, moving average, and integrated components for a time series dataset |
▼Unsupervised Learning | A collection of methods for unsupervised learning tasks |
►Association Rules | Methods used to discover patterns in transactional datasets |
Apriori Algorithm | Computes association rules for a given set of data |
►Clustering | Methods for clustering data |
k-Means Clustering | Partitions a set of observations into clusters by finding centroids that minimize the sum of observations' distances from their closest centroid |
►Dimensionality Reduction | Methods for reducing the number of variables in a dataset to obtain a set of principle variables |
Principal Component Analysis | Produces a model that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components |
Principal Component Projection | Projects a higher dimensional data point to a lower dimensional subspace spanned by principal components learned through the PCA training procedure |
►Topic Modelling | A collection of methods to uncover abstract topics in a document corpus |
Latent Dirichlet Allocation | Generates a Latent Dirichlet Allocation predictive model for a collection of documents |
▼Utilities | |
Columns to Vector | Create a new table with all feature columns inserted into a single column as an array |
Database Functions | Provides a collection of user-defined functions for performing common tasks in the database |
►Linear Solvers | Methods that implement solutions for systems of consistent linear equations |
Dense Linear Systems | Implements solution methods for large dense linear systems. Currently, restricted to problems that fit in memory |
Sparse Linear Systems | Implements solution methods for linear systems with sparse matrix input. Currently, restricted to problems that fit in memory |
Mini-Batch Preprocessor | Utility that prepares input data for use by models that support mini-batch as an optimization option |
PMML Export | Implements the PMML XML standard to describe and exchange models produced by data mining and machine learning algorithms |
Term Frequency | Provides a collection of functions for performing common tasks related to text analytics |
Vector to Columns | Converts a feature array in a single column of an output table into multiple columns |
▼Early Stage Development | |
Conjugate Gradient | Finds the solution to the function \( \boldsymbol Ax = \boldsymbol b \), where \(A\) is a symmetric, positive-definite matrix and \(x\) and \( \boldsymbol b \) are vectors |
DBSCAN | Partitions a set of observations into clusters of arbitrary shape based on the density of nearby neighbors |
Naive Bayes Classification | Constructs a classification model from a dataset where each attribute independently contributes to the probability that a data point belongs to a category |
Random Sampling | Provides utility functions for sampling operations |
XGBoost | This module allows you to use SQL to build gradient boosted tree models designed in XGBoost [1] |
▼Deprecated Modules | |
Create Indicator Variables | Provides utility functions helpful for data preparation before modeling |
Multinomial Logistic Regression | Also called as softmax regression, models the relationship between one or more independent variables and a categorical dependent variable |