Documentation

Latest User Guide

The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature.

Installation Guide

Information on initial installation and deployment of MADlib into a database instance.

Quick Start Guide for Users

Introduction to themes and concepts in MADlib. The guide walks the user through an initial data load, training a model, inspecting a model, and scoring a model.

Quick Start Guide for Developers

For developers who are interested in contributing to MADlib. Includes instructions for an available Docker image with necessary dependencies to compile and test MADlib.

Jupyter Notebooks for Getting Started

Includes many commonly used algorithms by data scientists.

Community Portal

Additional material for individuals looking to contribute to the project is available on our community portal.

Example Use Cases

Linear Regression

Linear regression is used to model the linear relationship of a scalar dependent variable to one or more explanatory independent variables.

Latent Dirichlet Allocation

Latent Dirichlet Allocation is a topic modeling function used to identify recurring themes in a large document corpus.

Summary

The summary function provides summary statistics for any data table. These statistics include: number of distinct values, number of missing values, mean, variance, min, max, most frequent values, quantiles, etc.

Logistic Regression

Logistic regression is used to predict a binary outcome of a dependent variable from one or more explanatory independent variables.

Elastic Net Regularization

Elastic Net regularization is a technique that can be applied to either linear or logistic regression to build a more robust model, in the event of large numbers of explanatory independent variables.

Principal Component Analysis

Pricipal Component Analysis is a dimensional reduction technique that can be used to transform a high dimensional space into a lower dimensional space.

Apriori

Apriori is a technique for evaluating frequent item-sets, which allows analysis of what events tend to occur together. For example, which items do customers frequently purchase together in a single transaction?

k-Means

k-Means is a clustering method used to identify regions of similarity within a dataset. It can be used for many types of analysis including customer segmentation.

Shortest Path in a Graph

Finds a path from a source vertex to every other vertex in the graph. Example uses include vehicle routing/navigation, degrees of separation in a social network and minimum delay path in a telecommunications network.