2.1.0
User Documentation for Apache MADlib
Modules
Here is a list of all modules:
[detail level 1234]
 Data Types and Transformations
 Arrays and MatricesMathematical operations for arrays and matrices
 Encoding Categorical VariablesFunctions to encode categorical variables to prepare data for input into predictive algorithms
 PathA function to perform complex pattern matching across rows and extract useful information about the matches
 PivotPivoting and data summarization tools for preparing data for modeling operations
 SessionizeSession reconstruction of data consisting of a time stampled sequence of events
 StemmingProvides porter stemmer operations supporting other MADlib modules
 Deep LearningA collection of modules for deep learning
 Model PreparationPrepare models and data for deep learning
 Train Single ModelFit, evaluate and predict for one model
 Train Multiple ModelsTrain multiple deep learning models at the same time for model architecture search and hyperparameter selection
 Utilities for Deep LearningUtilities specific to deep learning workflows
 GraphGraph algorithms and measures associated with graphs
 All Pairs Shortest PathFinds the shortest paths between every vertex pair in a given graph
 Breadth-First SearchFinds the nodes reachable from a given source vertex using a breadth-first approach
 HITSFind the HITS scores (authority and hub) of all vertices in a directed graph
 MeasuresA collection of metrics computed on a graph
 PageRankFind the PageRank of all vertices in a directed graph
 Single Source Shortest PathFinds the shortest path from a single source vertex to every other vertex in a given graph
 Weakly Connected ComponentsFind all weakly connected components of a graph
 Model SelectionFunctions for model selection and model evaluation
 Cross ValidationEstimates the fit of a predictive model given a data set and specifications for the training, prediction, and error estimation functions
 Prediction MetricsProvides various prediction accuracy metrics
 Train-Test SplitA method for splitting a data set into separate training and testing sets
 SamplingA collection of methods for sampling from a population
 Balanced SamplingA method to independently sample classes to produce a balanced data set. This is commonly used when classes are imbalanced, to ensure that subclasses are adequately represented in the sample
 Stratified SamplingA method for independently sampling subpopulations (strata)
 StatisticsA collection of probability and statistics modules
 Descriptive StatisticsMethods to compute descriptive statistics of a dataset
 Inferential StatisticsMethods to compute inferential statistics of a dataset
 Probability FunctionsProvides cumulative distribution, density/mass, and quantile functions for a wide range of probability distributions
 Supervised LearningMethods to perform a variety of supervised learning tasks
 Conditional Random FieldConstructs a Conditional Random Fields (CRF) model for labeling sequential data
 k-Nearest NeighborsFinds \(k\) nearest data points to the given data point and outputs majority vote value of output classes for classification, or average value of target values for regression
 Neural NetworkSolves classification and regression problems with several fully connected layers and non-linear activation functions
 Regression ModelsA collection of methods for modeling conditional expectation of a response variable
 Support Vector MachinesSolves classification and regression problems by separating data with a hyperplane or other nonlinear decision boundary
 Tree MethodsA collection of recursive partitioning (tree) methods
 Time Series AnalysisA collection of methods to analyze time series data
 ARIMAGenerates a model with autoregressive, moving average, and integrated components for a time series dataset
 Unsupervised LearningA collection of methods for unsupervised learning tasks
 Association RulesMethods used to discover patterns in transactional datasets
 ClusteringMethods for clustering data
 Dimensionality ReductionMethods for reducing the number of variables in a dataset to obtain a set of principle variables
 Topic ModellingA collection of methods to uncover abstract topics in a document corpus
 Utilities
 Columns to VectorCreate a new table with all feature columns inserted into a single column as an array
 Database FunctionsProvides a collection of user-defined functions for performing common tasks in the database
 Linear SolversMethods that implement solutions for systems of consistent linear equations
 Mini-Batch PreprocessorUtility that prepares input data for use by models that support mini-batch as an optimization option
 PMML ExportImplements the PMML XML standard to describe and exchange models produced by data mining and machine learning algorithms
 Term FrequencyProvides a collection of functions for performing common tasks related to text analytics
 Vector to ColumnsConverts a feature array in a single column of an output table into multiple columns
 Early Stage Development
 Conjugate GradientFinds the solution to the function \( \boldsymbol Ax = \boldsymbol b \), where \(A\) is a symmetric, positive-definite matrix and \(x\) and \( \boldsymbol b \) are vectors
 DBSCANPartitions a set of observations into clusters of arbitrary shape based on the density of nearby neighbors
 Naive Bayes ClassificationConstructs a classification model from a dataset where each attribute independently contributes to the probability that a data point belongs to a category
 Random SamplingProvides utility functions for sampling operations
 XGBoostThis module allows you to use SQL to build gradient boosted tree models designed in XGBoost [1]
 Deprecated Modules
 Create Indicator VariablesProvides utility functions helpful for data preparation before modeling
 Multinomial Logistic RegressionAlso called as softmax regression, models the relationship between one or more independent variables and a categorical dependent variable