1.20.0 User Documentation for Apache MADlib
 ▼Data Types and Transformations ►Arrays and Matrices Mathematical operations for arrays and matrices Encoding Categorical Variables Functions to encode categorical variables to prepare data for input into predictive algorithms Path A function to perform complex pattern matching across rows and extract useful information about the matches Pivot Pivoting and data summarization tools for preparing data for modeling operations Sessionize Session reconstruction of data consisting of a time stampled sequence of events Stemming Provides porter stemmer operations supporting other MADlib modules ▼Deep Learning A collection of modules for deep learning ►Model Preparation Prepare models and data for deep learning Train Single Model Fit, evaluate and predict for one model ►Train Multiple Models Train multiple deep learning models at the same time for model architecture search and hyperparameter selection ►Utilities for Deep Learning Utilities specific to deep learning workflows ▼Graph Graph algorithms and measures associated with graphs All Pairs Shortest Path Finds the shortest paths between every vertex pair in a given graph Breadth-First Search Finds the nodes reachable from a given source vertex using a breadth-first approach HITS Find the HITS scores (authority and hub) of all vertices in a directed graph ►Measures A collection of metrics computed on a graph PageRank Find the PageRank of all vertices in a directed graph Single Source Shortest Path Finds the shortest path from a single source vertex to every other vertex in a given graph Weakly Connected Components Find all weakly connected components of a graph ▼Model Selection Functions for model selection and model evaluation Cross Validation Estimates the fit of a predictive model given a data set and specifications for the training, prediction, and error estimation functions Prediction Metrics Provides various prediction accuracy metrics Train-Test Split A method for splitting a data set into separate training and testing sets ▼Sampling A collection of methods for sampling from a population Balanced Sampling A method to independently sample classes to produce a balanced data set. This is commonly used when classes are imbalanced, to ensure that subclasses are adequately represented in the sample Stratified Sampling A method for independently sampling subpopulations (strata) ▼Statistics A collection of probability and statistics modules ►Descriptive Statistics Methods to compute descriptive statistics of a dataset ►Inferential Statistics Methods to compute inferential statistics of a dataset Probability Functions Provides cumulative distribution, density/mass, and quantile functions for a wide range of probability distributions ▼Supervised Learning Methods to perform a variety of supervised learning tasks Conditional Random Field Constructs a Conditional Random Fields (CRF) model for labeling sequential data k-Nearest Neighbors Finds $$k$$ nearest data points to the given data point and outputs majority vote value of output classes for classification, or average value of target values for regression Neural Network Solves classification and regression problems with several fully connected layers and non-linear activation functions ►Regression Models A collection of methods for modeling conditional expectation of a response variable Support Vector Machines Solves classification and regression problems by separating data with a hyperplane or other nonlinear decision boundary ►Tree Methods A collection of recursive partitioning (tree) methods ▼Time Series Analysis A collection of methods to analyze time series data ARIMA Generates a model with autoregressive, moving average, and integrated components for a time series dataset ▼Unsupervised Learning A collection of methods for unsupervised learning tasks ►Association Rules Methods used to discover patterns in transactional datasets ►Clustering Methods for clustering data ►Dimensionality Reduction Methods for reducing the number of variables in a dataset to obtain a set of principle variables ►Topic Modelling A collection of methods to uncover abstract topics in a document corpus ▼Utilities Columns to Vector Create a new table with all feature columns inserted into a single column as an array Database Functions Provides a collection of user-defined functions for performing common tasks in the database ►Linear Solvers Methods that implement solutions for systems of consistent linear equations Mini-Batch Preprocessor Utility that prepares input data for use by models that support mini-batch as an optimization option PMML Export Implements the PMML XML standard to describe and exchange models produced by data mining and machine learning algorithms Term Frequency Provides a collection of functions for performing common tasks related to text analytics Vector to Columns Converts a feature array in a single column of an output table into multiple columns ▼Early Stage Development Conjugate Gradient Finds the solution to the function $$\boldsymbol Ax = \boldsymbol b$$, where $$A$$ is a symmetric, positive-definite matrix and $$x$$ and $$\boldsymbol b$$ are vectors DBSCAN Partitions a set of observations into clusters of arbitrary shape based on the density of nearby neighbors Naive Bayes Classification Constructs a classification model from a dataset where each attribute independently contributes to the probability that a data point belongs to a category Random Sampling Provides utility functions for sampling operations XGBoost This module allows you to use SQL to build gradient boosted tree models designed in XGBoost [1] ▼Deprecated Modules Create Indicator Variables Provides utility functions helpful for data preparation before modeling Multinomial Logistic Regression Also called as softmax regression, models the relationship between one or more independent variables and a categorical dependent variable