Apache MADlib: Big Data Machine Learning in SQL

  • Open source, commercially friendly Apache license
  • For PostgreSQL and Greenplum Database®
  • Powerful machine learning, graph, statistics and analytics for data scientists

Read More

Getting Started with Apache MADlib using Jupyter Notebooks

We have created a library of Jupyter Notebooks to help you get started quickly with MADlib. It includes many of the most commonly used algorithms by data scientists.

 

MADlib 1.14 Release (GA)

On May 1, 2018, MADlib completed its third release as an Apache Software Foundation Top Level Project.

New features include: Balanced datasets, personalized PageRank, mini-batch optimizer for multilayer perceptron neural networks (and associated pre-processor function), PostgreSQL 10.2 support.

Improvements:

  • K-nearest neighbors - Added weighted averaging/voting by distance.

  • Summary - Added more statistics including number of positive, negative, zero values and 95% confidence intervals.

  • Multilayer perceptron - Added support for one-hot encoded categorical dependent variable for classification.

You are invited to download the 1.14 release and review the release notes.

 

MADlib 1.13 Release (GA)

On Dec 22, 2017, MADlib completed its second release as an Apache Software Foundation Top Level Project.

New feature: Hyperlink-Induced Topic Search (HITS) link analysis algorithm.

Improvements:

  • k-nearest neighbors (kNN) - Added additional distance metrics, added list of neighbors in output table.

  • Multlayer perceptron (MLP) - now supports grouping.

  • Cross validation - Improved the stats reporting in the output table.

  • Correlation: Improved quality of results by only ignoring a NULL value and not the whole row containing the NULL.

You are invited to download the 1.13 release and review the release notes.

 

MADlib 1.12 Release (GA)

On Aug 29, 2017, MADlib completed its first release as an Apache Software Foundation Top Level Project.

New features include: All Pairs Shortest Path, Weakly Connected Components, Breadth First Search, Mulitple Graph Measures, Stratified Sampling, Train-test split, Multilayer Perceptron and various updates for Apache Top Level Project.

Improvements:

  • Decision tree and random forest - Allow expressions in feature list, Allow array input for features, Filter NULL dependent values in OOB, Add option to treat NULL as category.

  • Summary - Allow user to determine the number of columns per run, Improve efficiency of computation time by ~35%.

  • Sketch - Promote cardinality estimators to top level module from early stage.

You are invited to download the 1.12 release and review the release notes.

 

MADlib Graduates to Apache Top Level Project

On July 19, 2017, the ASF board established Apache MADlib as a Top Level Project, which was approved by unanimous vote of the directors present. Please see the associated press release from the ASF.

MADlib entered incubation in the fall of 2015 and made five releases as an incubating project. Along the way, the MADlib community has worked hard to ensure that the project is being developed according to the principles of the  The Apache Way. We will continue to do so in the future as a TLP, to the best of our ability.

Thank you to all who have contributed to the project so far, and we look forward more innovation in machine learning in the future as a TLP!

 

MADlib User Survey Results

In October 2016, we ran a survey asking MADlib users about a wide range of topics pertaining to this open source project, including desired new features. Thank you to all who responded.

You are welcome to view the survey results and make any comments or suggestions on the user mailing list.